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FOREWORD 


The  U.S.  Army  Research  Institute  is  examining  the  use  of  distance  learning 
technologies  for  use  by  soldiers  in  an  “on  demand”  environment.  The  research  imder  the 
WEBTRAIN  project,  sponsored  by  the  Training  and  Doctrine  Command  (TRADOC), 
seeks  to  provide  guidance  to  the  U.S.  Army  as  it  transforms  from  a  classroom-centric 
method  of  instruction  to  one  that  is  more  soldier-centric  and  asynchronous. 

An  important  part  of  learning  is  asking  questions.  How  this  can  be  accomplished 
in  learning  environments  that  are  distributed  over  time  and  distance  is  a  vital  concern  to 
course  designers  and  distance  learning  providers.  As  the  Army  shifts  its  focus  to  an 
anytime,  anywhere  training  paradigm,  the  learner’s  need  to  pose  specific  questions  and 
receive  accurate  feedback  will  persist.  The  potential  to  automate  such  a  function,  in  the 
absence  of  live  instructors,  needs  to  be  examined.  The  methods  and  results  of  research 
from  a  range  of  disciplines,  psychology,  cognitive  science,  linguistics,  and  information 
system  design,  offer  application  areas  for  advanced  research  in  Army  training  settings. 
These  application  areas  were  presented  to  the  Training  Development  and  Analysis 
Directorate,  TRADOC,  on  12  July  2001. 
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Question  Generation  as  a  Learning  Multiplier  in  Distributed  Learning  Environments 
EXECUTIVE  SUMMARY _ 


Research  Requirement: 

As  the  Army  transforms  its  training  enterprise  to  a  model  that  relies  more  on  learner- 
centric  instruction,  soldiers  will  assume  increased  responsibility  for  acquiring  knowledge 
and  developing  skills.  Questions  by  learners  arise  as  a  natural  part  of  the  learning 
process.  Questions  indicate  an  individual’s  lack  of  specific  knowledge,  an  inability  to 
comprehend,  or  a  need  to  construct  meaning.  Responsive  answers  to  well-structured 
questions  can  have  a  powerful  effect  on  learning.  Unfortunately,  anytime-anywhere 
instruction,  particularly  asynchronous  instruction,  may  lack  the  mechanism  for  students 
to  get  a  timely  response  to  a  spontaneous  question.  An  examination  of  the  research 
literature  on  question  generation  and  ideas  for  incorporating  best  practices  into  the  future 
Army  learning  architecture  are  needed. 

Procedure: 

A  review  of  the  literature  on  question  generation  as  an  important  learning  mechanism 
from  both  a  cognitive  and  information  sciences  perspective  was  conducted.  Then, 
empirical  studies  of  question  generation  in  different  learning  environments  using  various 
information  technologies  were  examined.  How  the  benefits  of  question  generation  may 
be  applied  to  distributed  learning  in  military  training  environments  was  then  assessed, 
with  a  specific  objective  to  identify  application  areas  for  implementation. 

Findings: 

More  than  100  documents  were  reviewed.  There  was  ample  evidence  that  training 
students  to  ask  good  questions  can  improve  the  comprehension  and  learning  of  technical 
materials.  Based  on  the  review,  nine  application  areas  for  implementing  the  research  into 
productive  practices  were  identified. 


Utilization  of  Findings: 

This  report  is  relevant  to  plarmers  and  training  developers  charting  the  future  course  of 
distributed  learning  and  to  software  designers  dealing  with  the  challenge  of  incorporating 
question  generation  into  advanced  distributed  learning  systems.  Two  of  the  practice 
areas  identified,  presenting  examples  of  good  questions  and  having  groups  generate 
questions  collaboratively,  are  already  being  pursued  by  the  U.S.  Army  Research  Institute 
with  U.S.  Army  Training  and  Doctrine  Command  as  part  of  an  effort  called  Army 
TEAMThink. 
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Question  Generation  as  a  Learning  Multiplier  in  Distributed  Learning  Environments 

Introduction 

In  tomorrow’s  dynamic  threat  environment,  Army  forces  may  have  to  deploy  on  short 
notice,  to  initiate  operations  that  caimot  be  adequately  predicted,  and  to  conduct  operations  that 
have  not  been  practiced.  Future  staffs  must  be  able  to  interact  with  globally  distributed  expertise. 
Future  forces  must  be  highly  adaptive  and  be  able  to  reorganize  quickly  to  meet  threats 
effectively.  Soldiers  must  continually  learn,  and  rehearse,  whether  in  school,  at  home  station,  at 
home,  or  in  the  theater  of  operations.  Providing  instruction  on  an  anytime-anywhere  basis  is  a 
key  to  improving  performance  and  maintaining  military  readiness. 

The  Army  Distance  Learning  Program  (TADLP),  initiated  in  1996,  has  the  vision  to 
deliver  individual,  collective,  and  self-development  training  to  soldiers  and  units  anytime  and 
anywhere  through  the  application  of  multiple  means  and  technologies  (TRADOC,  1999).  More 
than  500  courses  are  programmed  for  distance  learning  redesign.  Through  this  transformed 
content,  soldiers  can  utilize  self-paced  distance  learning  modules  delivered  at  their  home  station, 
at  their  workplace,  or  in  their  own  residence.  Soldiers  will  have  an  increased  responsibility  for 
acquiring  and  maintaining  skills,  often  in  the  absence  of  a  face-to-face  instructor.  Within  this 
paradigm  shift,  the  role  of  the  instructor  is  likely  to  change  as  the  learning  process  becomes  more 
soldier-centric. 

Whether  instruction  is  delivered  online  or  in  the  classroom,  questions  arise  as  a  natural 
part  of  the  learning  process.  Students  may  request  an  explanation,  a  clarification,  a  concrete 
example,  or  make  some  other  form  of  inquiry  that  reflects  uncertainty.  Questions  are  usually 
learner-centric,  indicating  an  individual’s  lack  of  specific  knowledge,  an  inability  to  comprehend, 
or  a  need  to  construct  meaning.  Unfortunately,  anytime-anywhere  instruction  may  lack  the 
mechanism  for  students  to  get  a  timely  response  to  a  spontaneous  question.  This  is  particularly 
true  of  around-the-clock  asynchronous  learning  environments  in  which  omnipresent  instructors 
are  impractical  due  to  the  expense  it  would  entail. 

A  fundamental  issue,  then,  concerns  the  impact  of  question  asking  on  learning.  How 
does  the  process  of  question  asking  influence  learning,  performance,  and  student  satisfaction? 
There  is  ample  research  evidence  that  training  students  to  ask  good  questions  can  improve  the 
comprehension  and  learning  of  technical  materials  (Palincsar  &  Brown,  1984;  King,  1992).  How 
can  this  finding  be  beneficially  applied  to  advanced  distributed  learning  environments?  A 
second  issue  concerns  the  nature  and  fi'equency  of  the  questions  that  are  likely  to  be  asked.  To 
what  extent  can  these  questions  be  anticipated  beforehand?  What  automated  mechanisms  can  be 
developed  to  answer  them?  How  important  is  the  capability  to  respond  quickly  rather  than  after 
an  unpredictable  waiting  period? 

This  report  focuses  on  the  application  of  question  generation  to  soldier-centric  distributed 
learning  (DL)  environments.  It  is  intended  for  a  broad  audience.  For  planners  and  training 
developers  charting  the  future  course  of  distributed  learning,  it  proposes  application  areas 
appropriate  for  “building  out”  the  research  into  productive  practices.  For  researchers  and 
software  designers  dealing  with  the  challenge  of  incorporating  question  generation  into  advanced 
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training  systems,  it  offers  a  rationale  from  both  a  cognitive  and  information  sciences  perspective. 
The  report  begins  with  a  review  of  the  cognitive  underpinnings  of  question  generation.  This  first 
section  covers  the  cognitive  mechanisms  of  question  generation  as  an  important  learning 
mechanism.  The  primary  focus  is  on  sincere  information-seeking  (SIS)  questions,  as  opposed  to 
questions  that  merely  monitor  the  flow  of  conversation.  The  second  section  covers  empirical 
studies  of  question  generation  in  different  learning  environments  using  various  information 
technologies.  We  present  preliminary  results  of  an  experiment  on  soldiers’  generating  questions 
for  other  soldiers  to  answer,  all  through  a  Website.  The  third  section  discusses  methods  for 
increasing  the  frequency  and  quality  of  questions.  The  fourth  section  presents  some  schemes  to 
identify  and  predict  “frequently  asked  questions”  (FAQ’s)  in  asynchronous  learning 
environments  and  some  methods  to  automate  the  handling  of  these  questions.  In  the  fifth 
section,  we  recommend  a  number  of  application  areas  for  test  and  implementation. 

Question  Generation 

The  design  of  question  asking  and  answering  facilities  in  any  information  system  requires 
an  in-depth  analysis  of  question  generation  mechanisms  (Lauer,  Peacock,  &  Graesser,  1992). 
Such  mechanisms  specify  the  representation  of  the  subject  matter,  the  cognitive  processes 
associated  with  inquiry,  and  the  social  context  of  the  communicative  interaction.  There  are  many 
reasons  why  a  question  facility  can  fail  in  a  learning  environment  or  some  other  information 
system.  This  section  should  help  researchers,  software  designers,  and  course  developers  identify 
some  of  the  potential  barriers  while  appreciating  some  of  the  salient  advantages  of  sophisticated 
question  generation  facilities.  There  will  be  an  emphasis  on  learning  environments  because  the 
long-term  objective  of  this  report  is  to  facilitate  learning  in  Army  training  efforts,  notably 
TADLP  efforts.  However,  this  section  provides  relevant  insights  to  the  design  of  any 
information  system. 

The  Impact  of  Question  Generation  on  Learning 

Researchers  in  cognitive  science  and  education  have  often  advocated  learning 
environments  that  encourage  students  to  generate  questions  (Beck,  McKeown,  Hamilton,  & 
Kucan,  1997;  Dillon,  1988;  Pressley  &  Forrest-Pressley,  1985).  There  are  several  reasons  why 
question  generation  might  play  a  central  role  in  learning.  The  most  frequent  reason  is  that  it 
promotes  active  learning  and  construction  of  knowledge  (Bransford,  Goldman,  &  Vye,  1991; 
Brown,  1988;  Scardamalia  &  Bereiter,  1985;  Papert,  1980).  The  learner  needs  to  actively 
construct  knowledge  during  learning  rather  than  being  bombarded  with  a  large  volume  of 
information.  Therefore,  a  computerized  learning  environment  should  be  a  scaffold  for  active 
construction  of  knowledge  (including  answering  student  questions)  rather  than  being  a  mere 
information  delivery  system  (Edelson,  Gordin,  &  Pea,  1999;  Papert,  1980;  Schank,  1999). 
Constructivist  approaches  have  been  so  compelling  during  the  last  decade  that  they  have  shaped 
the  standards  for  curriculum  and  instruction  in  the  United  States  during  the  last  decade,  e.g., 
Standards  for  the  English  Language  Arts  (NCTE,  1996),  Curriculum  and  Evaluation  Standards 
for  School  Mathematics  (NCTM,  1989),  National  Science  Education  Standards  (NRC,  1996). 
Most  of  these  standards  apply  to  education  (i.e.,  reading,  mathematics,  science)  rather  than  the 
training  of  specific  skills  and  content,  but  there  is  no  principled  reason  for  doubting  the  utility  of 
constructivism  in  training. 


2 


Question-generation  learning.  Question  generation-learning  is  an  environment  in  which 
learners  are  encouraged  or  compelled  to  ask  questions  while  they  study  material.  That  is,  they 
ask  questions  after  trying  to  comprehend  each  sentence,  paragraph,  or  section  in  the  text. 

Answers  to  the  questions  may  be  provided  by  an  expert  immediately  (as  in  synchronous  distance 
learning)  or  after  a  delay  (as  in  asynchronous  distance  learning).  Alternatively,  there  might  be  no 
satisfactory  answers  to  the  questions.  Question-generation  learning  (QGL)  may  be  effective  for 
several  reasons,  over  and  above  the  answers  that  are  delivered  in  an  ideal  learning  system.  The 
potential  advantages  of  QGL  are  summarized  in  Table  1.  Available  research  has  not  dissected 
which  of  the  particular  components  in  Table  1  are  primarily  responsible  for  the  potential  benefits 
of  QGL. 

Table  1 

Possible  reasons  why  question-generation  learning  may  improve  learning. 

(1)  Active  learning.  Learners  actively  construct  knowledge  in  the  service  of  questions  rather 
than  passively  receiving  information. 

(2)  Metacognition.  Learners  become  sensitive  to  their  own  knowledge  deficits  and 
comprehension  failures  while  they  attempt  to  comprehend  the  material. 

(3)  Self-regulated  learning.  Learners  tidce  charge  of  both  identifying  and  correcting 
comprehension  problems. 

(4)  Motivation  and  engagement.  Learners  are  more  motivated  and  engaged  in  the  material 
because  the  learning  experience  is  tailored  to  their  own  needs. 

(5)  Building  common  ground  with  author.  Learners  achieve  more  shared  knowledge  with  the 
author  of  the  material. 

(6)  Transfer  appropriate  processing.  Learners  are  normally  tested  by  answering  questions,  so 
generating  questions  as  part  of  the  learning  process  should  improve  the  overlap  between 
comprehension  representations  and  test  representations. 

(7)  Coding  of  cognitive  representation.  The  cognitive  representations  are  more  precise, 
specific,  and  elaborate  when  learners  generate  questions. 

It  is  well  documented  that  improvements  in  the  comprehension,  learning,  and  memory  of 
technical  material  can  be  achieved  by  training  students  to  ask  questions  during  learning 
(Ciardiello,  1998;  King,  1989,  1992, 1994;  Rosenshine,  Meister,  &  Chapman,  1996).  The 
process  of  question  generation  accounts  for  a  significant  amount  of  these  improvements  from 
QGL,  over  and  above  the  information  supplied  by  answers.  Moreover,  QGL  is  most  effective 
when  students  are  trained  to  ask  good  questions.  Exactly  what  constitutes  a  good  question  will 
be  discussed  later  when  a  taxonomy  of  questions  is  presented.  Rosenshine  et  al.  (1996)  provided 
the  most  comprehensive  analysis  of  the  impact  of  QGL  on  learning  in  their  meta-analysis  of  26 
empirical  studies  that  compared  QGL  to  conditions  with  appropriate  controls.  Their  meta¬ 
analysis  included  1 7  studies  in  which  students  were  taught  question-asking  skills  and  later  tested 
on  their  comprehension  or  memory.  The  other  nine  studies  adopted  a  reciprocal  teaching  method 
in  which  the  teacher  spends  75%  of  the  time  asking  and  responding  to  questions;  that  is,  the 
teacher  models  good  question  asking  and  then  has  the  students  ask  the  questions  in  a  modeling¬ 
scaffolding-fading  process.  The  outcome  measures  in  these  studies  included  standardized  tests, 
short-answer  or  multiple-choice  questions  prepared  by  experimenters,  and  summaries  of  the 
texts.  The  median  effect  size  was  .36  for  the  standardized  tests,  .87  for  the  experimenter- 
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generated  tests,  and  .85  for  the  summary  tests.  There  were  no  significant  differences  between  the 
17  studies  that  had  question  asking  instructions  and  the  nine  that  had  the  reciprocal  teaching 
method. 

One  informative  result  of  the  Rosenshine  et  al.  meta-analysis  was  that  the  question  format 
was  important  when  training  the  learners  how  to  ask  questions.  The  analysis  compared  training 
with  signal  words  {who,  what,  when,  where,  why,  and  how),  training  with  generic  question  stems 
{How  is  X  like  Y?,  Why  is  X  important?.  What  conclusions  can  you  draw  about  X?),  and  training 
with  main  idea  prompts  {What  is  the  main  idea  of  paragraph  X).  The  generic-question  stems 
were  the  best  perhaps  because  they  give  the  learner  more  direction,  are  more  concrete,  and  are 
easier  to  teach  and  apply.  This  result  is  informative  because  it  will  give  us  some  guidance  in 
designing  question-prompt  capabilities  for  distributed  learning  applications. 

Another  informative  result  pertains  to  the  feedback,  comments,  and  corrections  that 
learners  receive  on  their  questions.  Most  of  the  studies  included  in  the  meta-analysis  probably 
provided  feedback  to  the  learners  on  the  quality  of  their  questions,  but  a  description  of  such 
feedback  was  absent  in  most  of  the  reported  studies.  Therefore,  the  role  of  question-asking 
feedback  is  an  issue  for  future  investigations,  perhaps  in  basic  research.  Feedback  can 
potentially  come  from  a  teacher,  a  peer  learner,  or  a  computer.  MacGregor  (1988),  for  example, 
developed  a  computer-based  instructional  format  that  modeled  good  questions  when  the  learner 
appeared  to  be  facing  problems  in  question  asking. 

Question  Generation  in  Distance  Learning.  The  available  research  on  QGL  has  not 
evolved  to  the  point  of  having  systematic  comparisons  between  synchronous  DL,  asynchronous 
DL,  and  other  learning  environments.  This  comparison  is  presumably  important  because 
learners  receive  an  immediate  answer  to  their  questions  in  synchronous  learning  but  must  wait 
several  hours  or  days  for  an  answer  in  asynchronous  learning.  We  performed  a  literature  search 
for  relevant  articles  on  QGL  that  were  published  during  the  last  five  years  in  the  following 
sources:  ERIC,  PsychLit,  American  Educational  Research  Journal,  American  Journal  of 
Distance  Education,  Cognition  &  Instruction,  Educational  Researcher,  Educational  Technology 
Review  Interactive  Learning  Environments,  Journal  of  Educational  Multimedia  and  Hypermedia, 
Journal  of  Educational  Psychology,  Journal  of  Experimental  Education,  Journal  of  Interactive 
Learning  Research,  Journal  of  the  Learning  Sciences,  Review  of  Educational  Research,  and 
WebNet  Journal.  (We  hereafter  refer  to  this  as  the  “QG  literature  search”).  We  could  not  find  a 
single  study  that  compared  questions  asked  in  synchronous  versus  asynchronous  DL  or  between 
DL  and  other  learning  environments. 

More  global  social  networks  of  users  have  been  participating  in  computer  supported 
collaborative  learning  (Otaga,  Sueda,  Furugori,  &  Yani,  1999;  Songer,  1996)  and  informal  peer- 
help  networks  (Greer,  1999).  Users  are  distributed  at  different  geographical  locations  and 
communicate  by  e-mail.  Learners  ask  and  answer  questions  of  their  peers,  or  communicate  with 
designated  experts.  In  some  systems,  participants  in  the  network  score  points,  or  accumulate 
credits,  by  helping  others  and  answering  questions  posed  by  their  peers.  The  selfish  participants 
who  seek  help,  but  never  provide  help,  end  up  depleting  their  credits  and  eventually  lose  access 
to  the  community  resource.  One  practical  research  issue  to  explore  is  whether  learners  will  seek 
help  more  than  supply  help  in  the  information  economy.  Also,  the  pragmatics  of  e-mail 
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communication  may  be  quite  different  from  face-to-face  (FTF)  communication.  In  FTF 
communication,  questioners  can  routinely  rely  on  an  answer  being  supplied  by  the  listener. 
Indeed,  it  would  be  rude  not  to  respond  to  the  questions.  Another  practical  research  issue  is 
whether  peers  or  experts  are  consulted  most  often  when  a  learner  has  a  question.  A  comfortable 
chat  may  be  higher  in  priority  than  gaining  quality  information. 

The  conclusion  that  QGL  is  effective  is  indisputable,  but  there  are  two  general  concerns 
that  might  limit  the  scope  of  QGL  in  promoting  learning  gains.  The  first  concern  is  that  it  is 
difficult  to  disentangle  (a)  the  process  of  asking  questions,  (b)  feedback  to  the  learner  on  the 
quality  of  the  question,  and  (c)  the  quality  of  the  answer.  The  comparative  impact  of  these  three 
components  on  learning  gains  has  not  been  reported  in  the  available  meta-analyses.  The  second 
concern  is  that  the  pragmatic  context  of  question  asking  has  been  urmatural  or  vague  in  many  of 
these  studies.  For  example,  consider  the  studies  in  which  the  learner  is  instructed  to  generate 
questions  while  reading.  There  is  no  obvious  recipient  of  the  question,  so  it  is  difficult  to 
determine  whether  a  question  is  a  sincere  information-seeking  (SIS)  question  or  is  a  forced 
exercise  in  elaborating  the  material.  Consider  the  studies  in  which  the  teacher  models  question¬ 
asking  skills.  Most  of  these  questions  are  not  SIS  questions  because  the  teacher  already  knows 
the  answers  to  the  questions.  What  is  needed  is  an  authentic  context  in  which  learners  are  vested 
in  the  questions  that  they  ask.  The  next  subsection  on  question-asking  mechanisms  will  address 
the  pragmatics  of  question  asking  and  the  conditions  in  which  bona  fide  questions  are  asked. 

Question  Generation  Mechanisms 

It  could  be  argued  that  any  given  task  that  a  person  performs  can  be  decomposed  into  a 
set  of  questions  that  a  person  asks  and  answers.  For  example,  when  a  soldier  encounters  a  device 
that  malfunctions,  the  relevant  questions  are  “What’s  wrong?”  and  “How  can  it  be  fixed?”  When 
an  officer  reads  a  situation  report,  the  relevant  questions  are  “Why  is  this  important?”  and  “What 
should  I  do  about  it,  if  anything?”.  When  a  young  adult  reads  recruiting  material,  the  relevant 
questions  are  “What’s  interesting?”,  “Do  I  want  to  join?”,  and  “What  are  the  perks?”.  The 
cognitive  mechanisms  that  trigger  question-asking  and  exploration  patterns  need  to  be 
understood  in  order  to  optimize  learning  in  the  modem  workplace,  whether  it  be  text,  visual 
displays,  mechanical  devices,  electronic  equipment,  or  telecommunication  systems. 

There  are  differences  between  SIS  questions  and  questions  that  do  not  invite  answers  that 
the  questioner  particularly  cares  about  (Graesser  &  Person,  1994;  Kreuz  &  Graesser,  1993;  Van 
der  Meij,  1987).  Van  der  Meij  identified  1 1  assumptions  that  need  to  be  in  place  in  order  for  a 
question  to  count  as  a  SIS  question.  These  1 1  assumptions  are  listed  in  Table  2.  A  question  is  a 
misfire  (non-SIS  question)  if  one  or  more  of  these  assumptions  are  not  met.  For  example,  when 
a  computer  science  teacher  grills  a  student  with  a  question  in  a  classroom  {What  is  RAM?),  it  is 
not  a  SIS  question  because  it  violates  assumptions  1,  5,  8,  and  10.  When  a  lawyer  cross- 
examines  a  witness  with  a  question,  it  is  not  a  SIS  question  because  it  violates  assumptions  1,  3, 
4, 5,  and  8.  A  standard  lawyer  maxim  is  “never  ask  a  question  unless  you  know  the  answer.” 
Similarly,  assumptions  in  Table  2  get  violated  when  there  are  rhetorical  questions  {When  does  a 
person  know  when  he  or  she  is  happy?),  gripes  {When  is  it  going  to  stop  snowing?),  greetings 
{How  are  you?),  and  attempts  to  redirect  the  flow  of  conversation  in  a  group  (a  hostess  asks  Bill, 
who  has  been  quiet  all  night.  So  when  is  your  next  vacation  Bill?).  In  contrast,  a  question  is  a 


SIS  question  when  a  person’s  computer  is  malfunctioning  and  the  person  asks  a  technical 
assistant:  What 's  wrong  with  my  computer? 

Table  2 

Assumptions  behind  sincere  information-seeking  questions  (Van  der  Meij,  1987). 


1 .  The  questioner  does  not  know  the  information  he  asks  for  with  the  question. 

2.  The  question  specifies  the  information  sought  after. 

3.  The  questioner  believes  that  the  presuppositions  to  the  question  are  true. 

4.  The  questioner  believes  that  an  answer  exists. 

5 .  The  questioner  wants  to  know  the  answer. 

6.  The  questioner  can  assess  whether  a  reply  constitutes  an  answer. 

7.  The  questioner  poses  the  question  only  if  the  benefits  exceed  the  costs. 

8.  The  questioner  believes  that  the  respondent  knows  the  answer. 

9.  The  questioner  believes  that  the  respondent  will  not  give  the  answer  in  absence  of  a  question. 

1 0.  The  questioner  believes  that  the  respondent  will  supply  the  answer. 

11.  A  question  solicits  a  reply. 


These  pragmatic  assumptions  have  nontrivial  implications  from  the  standpoint  of  the 
design  of  future  advanced  distributed  learning  systems.  There  are  many  ways  that  a  learning 
system  or  some  other  type  of  information  system  can  break  down.  Learners  will  quickly  give  up 
using  a  system  if  the  learner  is  unable  to  articulate  the  information  in  sufficient  detail  to  get 
useful  answers  (assumption  2),  if  the  system  does  not  correct  misleading  presuppositions 
(assumption  3),  if  the  system  does  not  supply  much  useful  information  in  the  answer  (assumption 
8),  if  the  system  has  trouble  delivering  any  information  (assumptions  10  and  11),  and  if  the 
system  carmot  recognize  silly  questions  posed  by  the  user  (assumptions  1,  4,  5  and  7). 

A  cognitive  computational  model  of  question-asking,  called  PREG,  has  recently  been 
developed  by  Graesser  and  his  colleagues  (Graesser,  Olde,  Pomeroy,  Whitten,  Lu,  &  Craig,  in 
press;  Otero  &  Graesser,  in  press).  According  to  the  PREG  model,  cognitive  disequilibrium 
drives  the  asking  of  SIS  questions  (Collins,  1988;  Festinger,  1957;  Schank,  1999).  That  is, 
questions  are  asked  when  individuals  are  confronted  with  obstacles  to  goals,  anomalous  events, 
contradictions,  discrepancies,  salient  contrasts,  obvious  gaps  in  knowledge,  expectation 
violations,  and  decisions  that  require  discrimination  among  equally  attractive  alternatives.  The 
answers  to  such  questions  are  expected  to  restore  equilibrium  (homeostatic  balance).  Otero  and 
Graesser  (in  press)  developed  a  set  of  production  rules  that  specify  the  categories  of  questions 
that  are  asked  under  particular  conditions  (i.e.,  content  features  of  text  and  knowledge  states  of 
individuals).  Some  example  production  rules  are  shown  below. 
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(A)  Unknown  word.  A  reader  may  be  ignorant  of  the  meaning  of  a  word. 

IF  A  content  word  W  (noun,  main  verb,  or  adjective)  in  the  text  is  not  known 

THEN  Ask:  “What  does  W  mean?” 

(B)  Unknown  referent.  The  explicit  text  mentions  a  noun  or  pronoun  N,  but  it  is  difficult  to 
construct  or  identify  a  referent  in  the  mental  model  that  corresponds  to  N. 

IF  A  referent  of  a  noun  or  pronoun  N  is  not  known 

THEN  Ask:  “What/which  N?” 

(C)  Discrepant  Statement.  An  explicit  statement  S  in  the  text  is  discrepant  with  a  reader’s 
knowledge  of  the  prior  explicit  text  or  with  world  knowledge. 

IF  Statement  S  clashes  with  prior  text  or  with  world  knowledge  &  no  relation  in  the 
explicit  text  is  linked  to  and  accounts  for  S 

THEN  Ask:  “Why  did  S  occur/exist?”,  “How  did  S  occur/exist?”,  or 
“Why  does  the  author  say  S?” 

There  are  dozens  of  these  production  rules,  but  it  is  beyond  the  scope  of  this  report  to  present  the 
full  inventory  of  the  knowledge  structures  and  production  rules.  From  the  present  standpoint, 
these  rules  provide  an  a  priori  theoretical  foundation  for  generating  frequently  asked  questions 
(FAQ)  and  deciding  what  queries  to  include  in  system  design.  More  will  be  said  about  FAQ  in  a 
later  section. 

The  ability  to  detect  cognitive  disequilibrium  at  appropriate  points  while  reading  is  an 
excellent  index  of  whether  a  person  understands  technical  text  (Baker,  1985;  Burbules  &  Linn, 
1988;  Glenberg,  Wilkinson,  &  Epstein,  1982;  Kintsch,  1998;  Otero  &  Campanario,  1990).  For 
example,  if  there  is  a  direct  contradiction  in  the  text,  this  should  be  noticed.  Poor  comprehenders 
gloss  over  such  contradictions  and  sustain  an  “illusion  of  comprehension.”  A  deep 
comprehender  actively  seeks  possible  contradictions,  clashes  with  world  knowledge,  and  gaps  in 
background  knowledge  (Beck  et  al.,  1997;  Hacker,  Dunlosky,  &  Graesser,  1998).  The  detection 
of  cognitive  disequilibrium  can  be  manifested  in  several  ways.  Reading  time  slows  down.  Eye 
movements  regress  to  previous  sections  of  text,  or  between  contradictory  constituents.  And  of 
course,  one  obvious  manifestation  is  that  the  learner  asks  questions  (Graesser  &  McMahen, 

1993;  Otero  &  Graesser,  in  press). 

Detecting  Disequilibrium.  The  detection  of  cognitive  disequilibrium  is  alone  not 
sufficient  for  question  generation.  According  to  the  research  conducted  by  Graesser  and 
McMahen  (1993),  the  potential  question  asker  must  pass  two  additional  hurdles  after  the 
detection  of  disequilibrium:  articulation  of  the  question  in  words  (called  verbal  coding)  and  the 
courage  to  express  the  question  in  a  social  setting  (called  social  editing).  Thus,  three  stages  need 
to  be  intact  for  a  question  to  be  produced  (disequilibrium  detection,  verbal  coding,  and  social 
editing).  Graesser  and  McMahen  investigated  this  by  having  college  students  read  different 
versions  of  stories  and  mathematical  word  problems:  contradictions  between  text  statements, 
deletion  of  critical  information,  insertion  of  irrelevant  information,  and  control  (no  anomalies). 
The  likelihood  of  generating  questions  was  higher  in  the  anomaly  conditions  than  in  the  control 
condition  when  subjects  were  instructed  to  generate  questions,  as  would  be  predicted  by  the 
PREG  model.  It  is  informative  to  note  that  the  likelihood  that  students  asked  questions  was 
extremely  low  (.04)  when  they  were  not  instructed  to  ask  questions,  but  were  merely  permitted  to 
ask  questions  of  an  experimenter  in  an  adjacent  room.  Thus,  questions  do  not  surface  when  it  is 
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a  physical  or  social  effort  to  ask  them.  It  is  important  to  minimize  these  barriers  in  learning  and 
information  systems. 

In  a  project  funded  by  the  Office  of  Naval  Research,  Graesser  and  his  colleagues 
investigated  the  questions  that  college  students  ask  when  an  everyday  device  malfunctions 
(Graesser,  Olde,  Pomeroy,  Whitten,  Lu,  &.  Craig,  in  press;  Graesser,  Olde,  &  Lu,  2001).  After 
reading  about  a  device  (e.g.,  cylinder  lock,  dishwasher,  electric  bell,  toaster),  the  participants 
received  scenarios  in  which  the  device  breaks  down  (e.g.,  the  key  turns  but  the  bolt  doesn 't  move, 
in  the  context  of  a  cylinder  lock)  and  they  generated  questions  about  the  malfunction.  The 
breakdown  scenario  was  expected  to  put  the  students  in  cognitive  disequilibrium.  Deep 
comprehenders  are  expected  to  explore  the  faults  of  the  breakdown,  to  ask  good  questions,  and 
thereby  to  restore  cognitive  equilibrium.  Eye-tracking  data  were  also  collected  during  question 
asking  in  one  of  the  empirical  studies.  There  are  two  straightforward  predictions  of  the  PREG 
model.  First,  those  participants  who  have  a  deep  understanding  of  the  device  should  ask  good 
questions  that  converge  on  faults.  Second,  the  eye  movements  of  deep  comprehenders  should 
converge  on  likely  faults  that  explain  the  breakdown.  Both  of  these  predictions  were  supported 
in  the  empirical  investigations.  It  is  interesting  to  note  that  deep  comprehenders  do  not 
necessarily  ask  more  questions.  They  ask  better  questions  but  not  a  higher  frequency  of 
questions  (Fishbein,  Eckart,  Lauer,  van  Leeuwen,  &  Langmeyer,  1990;  Graesser  et  al.,  in  press). 

One  practical  implication  of  the  PREG  model  is  that  it  is  necessary  to  put  the  learners  in 
cognitive  disequilibrium  before  they  start  to  ask  questions  in  which  they  have  a  vested  interest. 
Otherwise,  they  settle  for  a  shallow  understanding  that  maintains  an  illusion  of  comprehension. 
Therefore,  the  learning  environment  should  present  challenges,  obstacles,  contradictions, 
puzzles,  tough  decisions,  and  other  events  that  crack  the  barrier  of  shallow  knowledge.  One  of 
the  sobering  facts  about  education,  training,  and  comprehension  is  that  most  students  are  not 
sensitive  to  their  own  comprehension  deficits.  Research  on  comprehension  calibration  has 
revealed  that  there  is  a  low  correlation  (.1  to  .4)  between  a  learner’s  perception  of  how  well  they 
comprehend  technical  text,  as  measured  by  ratings,  and  how  well  they  actually  comprehend  the 
text,  as  measured  by  objective  tests  (Glenberg  &  Epstein,  1985;  Maki,  1998).  Students  learn  the 
limits  of  their  knowledge  after  they  work  on  challenging  problems  and  apply  didactic  knowledge 
in  practical  arenas.  These  are  precisely  the  learning  contexts  that  create  cognitive  disequilibrium 
and  stimulate  questions. 

Knowledge  Representations  and  Cognitive  Processes 

The  contrast  between  shallow  and  deep  knowledge  is  fundamental  (Bloom,  1956; 
Bransford  et  al.,  1991),  so  we  will  define  it  in  this  subsection.  Shallow  knowledge  consists  of 
explicitly  mentioned  ideas  in  a  text  that  refer  to:  lists  of  concepts,  a  handful  of  simple  facts  or 
properties  of  each  concept,  simple  definitions  of  key  terms,  and  major  steps  in  a  procedure  (not 
the  detailed  steps).  Deep  knowledge  consists  of  coherent  explanations  of  the  material  that  fortify 
the  learner  for  generating  inferences,  solving  problems,  making  decisions,  integrating  ideas, 
synthesizing  new  ideas,  decomposing  ideas  into  subparts,  forecasting  future  occurrences  in  a 
system,  and  applying  knowledge  to  practical  situations.  Deep  knowledge  is  presumably  needed 
to  articulate  and  manipulate  symbols,  formal  expressions,  and  quantities,  although  some 
individuals  can  master  these  skills  after  extensive  practice  without  deep  mastery.  Deep 
knowledge  is  essential  for  handling  challenges  and  obstacles  because  there  is  a  need  to 
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understand  how  mechanisms  work  and  to  generate  and  implement  novel  plans.  Explanations  are 
central  to  deep  knowledge,  whether  the  explanations  consist  of  logical  justifications,  causal 
networks,  or  goal-plan-action  hierarchies.  It  is  well  documented  that  the  construction  of  coherent 
explanations  is  a  robust  predictor  of  an  adult’s  ability  to  learn  technical  material  from  written 
texts  (Chi,  deLeeuw,  Chiu,  &  LaVancher,  1994;  VanLehn,  Jones,  &  Chi,  1992;  Webb,  Troper,  & 
Fall,  1995). 

Researchers  in  cognitive  science,  artificial  intelligence,  and  discourse  processes  have 
specified  different  types  and  levels  of  knowledge  representation  in  rich  detail  (Graesser,  Gordon, 
&  Brainerd,  1992;  Kintsch,  1998;  Lehmann,  1992).  This  subsection  succinctly  points  out  those 
distinctions  that  have  a  salient  bearing  on  question  generation  in  DL  and  that  clarify  the 
distinction  between  deep  and  shallow  learning. 

Text  and  Pictures.  The  representations  of  texts  and  pictures  can  be  segregated  into  the 
levels  of  surface  code,  explicit  propositions,  mental  models,  and  pragmatic  interaction.  The  most 
shallow  level  is  the  surface  code,  which  preserves  the  exact  wording  and  syntax  of  the  explicit 
verbal  material.  When  considering  the  visual  modality,  it  preserves  the  low-level  lines,  angles, 
sizes,  shapes,  and  textures  of  the  pictures.  The  explicit  proposition  representation  captures  the 
meaning  of  the  explicit  text  and  the  pictures.  A  proposition  contains  a  predicate  (main  verb, 
adjective,  connective)  that  interrelates  one  or  more  arguments  (noun-referents,  embedded 
propositions).  Examples  of  propositions  are  the  soldier  repaired  the  computer 
[repair(soldier,computer)],  and,  in  the  acronymic  parlance  of  the  Army  Battle  Command 
System',  the  AMDWS  operator  can  pass  messages  to  other  ATCCS  within  the  TOC  [can  pass 
(AMDWS  operator  (within  TOC(other  ATCCS)))].  At  the  deepest  level,  there  is  the  mental 
model  of  what  the  text  is  about.  For  everyday  devices,  this  would  include:  the  components  of  the 
electronic  or  mechanical  system,  the  spatial  arrangement  of  components,  the  causal  chain  of 
events  when  the  system  operates,  the  mechanisms  that  explain  each  causal  step,  the  functions  of 
the  device  and  device  components,  and  the  plans  of  agents  who  manipulate  the  system  for 
various  purposes.  Finally,  there  is  the  pragmatic  communication  level  that  specifies  the  main 
messages  that  the  author  is  trying  to  convey  to  the  reader  (or  the  narrator  to  the  audience). 
Examples  of  purposes  of  reading  are  to  explain  how  to  repair  equipment,  to  advertise  a  product, 
or  to  protect  someone  from  a  hazardous  condition. 

The  types  of  representations  are  theoretically  different  from  the  levels.  From  the  present 
standpoint,  there  are  several  types  of  knowledge  representation  affiliated  with  the  explicit 
propositions  and  mental  models.  Table  3  lists  some  important  types  of  knowledge 
representations  that  are  important  for  military  contexts.  Specific  question  categories  are 
associated  with  particular  types  of  knowledge  representation.  Eaeh  of  these  types  of  knowledge 
become  progressively  deeper  to  the  extent  that  they  are  more  fine-grained  (i.e.,  the  grain  size  has 
high  resolution)  and  have  more  complex  interconnections  among  subcomponents  (i.e.,  there  are 
more  relational  links  and  more  links  that  deviate  from  a  strict  hierarchy). 


'  AMDWS  refers  to  Air  Missile  Defense  Workstation;  ATCCS  refers  to  Army  Tactical  Command  and 
Control  System;  TOC  refers  to  Tactical  Operations  Center. 
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Table  3 

Important  types  of  knowledge  representation  for  the  military. 

Agents.  Organized  sets  of  troops,  units,  organizations,  countries,  complex  software  units,  etc. 

Examples  are  organizational  charts,  friend-foe  networks,  and  client-server  networks. 

Class  inclusion.  One  concept  is  a  subtype  or  subclass  of  another  concept. 

For  example,  an  M1A2  is-a  tank  is-a  weapon  system. 

Spatial  layout.  Spatial  relations  among  regions  and  entities  in  regions. 

For  example.  Eagle  Base  is-in  Bosnia  is-in  Eastern  Europe.  Bosnia  is-south-of  Germany. 
Compositional  structures.  Components  have  subparts  and  subcomponents. 

For  example,  a  computer  has-as-parts  a  monitor,  keyboard,  CPU,  and  memory. 
Procedures  and  plans.  A  sequence  of  steps/actions  in  a  procedure  accomplishes  a  goal. 

Examples  are  performance  steps  in  disassembling  and  removing  a  mine,  and  the  sequence 
of  actions  to  create  a  command  post  filter  in  the  Common  Tactical  Picture  application. 
Causal  chains  and  networks.  An  event  is  caused  by  a  sequence  of  events  and  enabling  states. 

Examples  are  the  stages  in  firing  an  artillery  round  and  the  chain  of  events  in  a  battle. 
Others.  Property  descriptions,  quantitative  specifications,  rules,  mental  states  of  agents. 

Cognitive  processes  also  vary  in  difficulty.  Table  4  lists  the  major  types  of  cognitive 
processes  that  are  relevant  to  an  analysis  of  questions  (Standards  of  the  Army  Training  Support 
Center,  2000;  Bloom,  1956).  According  to  Bloom’s  taxonomy  of  cognitive  objectives,  the 
cognitive  processes  with  higher  numbers  are  more  difficult  and  require  greater  depth. 

Recognition  and  recall  are  the  easiest,  comprehension  is  intermediate,  and  classes  4-7  are  the 
most  difficult.  It  is  debatable  whether  there  are  differences  in  difficultly  among  categories  4-7, 
so  they  are  often  collapsed  into  one  category  in  most  applications  of  Bloom’s  taxonomy. 

Table  4 

Types  of  cognitive  processes  that  are  relevant  to  questions. 

(1)  Recognition.  The  process  of  verbatim  identification  of  specific  content  (e.g.,  terms,  facts, 
rules,  methods,  principles,  procedures,  objects)  that  was  explicitly  mentioned  in  the  text. 

(2)  Recall.  The  process  of  actively  retrieving  from  memory  and  producing  content  that  was 
explicitly  mentioned  in  the  text. 

(3)  Comprehension.  Demonstrating  understanding  of  the  text  at  the  mental  model  level  by 
generating  inferences,  interpreting,  paraphrasing,  translating,  explaining,  or  summarizing. 

(4)  Application.  The  process  of  applying  knowledge  extracted  from  text  to  a  problem,  situation, 
or  case  (fictitious  or  real-world)  that  was  not  explicitly  mentioned  in  the  text. 

(5)  Analysis.  The  process  of  decomposing  elements  and  linking  relationships  between  elements. 

(6)  Synthesis.  The  processing  of  assembling  new  patterns  and  structures,  such  as  constructing  a 
novel  solution  to  a  problem  or  composing  a  novel  message  to  an  audience. 

(7)  Evaluation.  The  process  of  judging  the  value  or  effectiveness  of  a  process,  procedure,  or 
entity,  according  to  some  criteria  and  standards. 

Question  Taxonomies 

Schemes  for  analyzing  the  qualitative  characteristics  of  questions  have  been  proposed  by 
researchers  in  education  (Beck  et  al.,  1997;  Ciardiello,  1998;  Dillon,  1988;  Flammer,  1981), 


psychology  (Dillon,  Golding,  &  Graesser,  1988;  Graesser  &  Person,  1994),  computational 
linguistics  and  artificial  intelligence  (Allen,  1983, 1995;  Jurafsky  &  Martin,  2000;  Lehnert,  1997; 
Schank,  1986;  Webber,  1988).  Rather  than  reviewing  all  of  these  schemes,  we  will  adopt  the 
taxonomy  proposed  by  Graesser  and  Person  (1994).  The  Graesser-Person  taxonomy  is  both 
grounded  theoretically  in  cognitive  science  and  has  been  successfully  applied  to  a  large  number 
of  question  corpora  (such  as  human  and  computer  tutoring,  questions  asked  while  using  a  new 
computer  system,  questions  asked  while  comprehending  text,  questions  raised  in  television  news, 
and  questions  by  letters  to  an  editor).  The  taxonomy  is  presented  in  Appendix  A.  Later  in  this 
report  we  will  apply  the  Graesser-Person  taxonomy  to  a  corpus  of  questions  asked  by  Army 
officers  in  a  Captains  Career  Course. 

The  18  categories  are  defined  according  to  the  content  of  the  information  sought  rather 
than  on  question  signal  words  {who,  what,  why,  how,  etc.).  The  question  categories  can  be 
recognized  by  particular  generic  question  frames  (which  are  comparatively  distinctive),  but  not 
simply  by  signal  words  (which  can  be  ambiguous).  As  discussed  earlier,  generic  question  frames 
are  easier  to  teach  to  learners  and  produce  more  learning  gains  than  do  signal  words  (Rosenshine 
et  al.,  1996;  King  &  Rosenshine,  1993).  Appendix  A  includes  a  question  category  label  and  the 
generic  question  frame  for  each  category.  Concrete  examples  of  these  1 8  categories  can  be  found 
in  previous  publications  (Graesser,  Lang,  &  Morgan,  1988;  Graesser,  Person,  &  Huber,  1992). 

Graesser  and  Person  (1994)  reported  that  some  of  the  categories  of  SIS  questions  are 
associated  with  deep  comprehension  and  cognitive  processes.  Specifically,  they  regarded 
question  categories  10  though  15  as  deep-reasoning  questions  because  they  tapped  causal 
networks,  planning,  and  logical  justifications  (i.e.,  answers  to  why,  how,  what-if,  and  what-if- 
not).  They  tested  this  by  analyzing  students’  SIS  questions  in  tutoring  sessions  on  research 
methods  and  basic  mathematics.  SIS  questions  constituted  28%  of  the  student  questions. 

Trained  judges  classified  student  questions  into  the  18  categories  with  a  high  degree  of  reliability 
(Chronbach’s  alpha  of  .8 1  or  higher).  The  proportion  of  SIS  questions  that  were  in  question 
categories  10-15  was  recorded.  A  separate  set  of  trained  judges  classified  questions  on  depth 
using  five  levels  that  map  onto  Bloom’s  taxonomy;  they  could  do  this  with  high  reliability 
(Chronbach’s  alpha  =  .85).  A  depth  scores  was  determined  for  each  student,  which  consisted  of 
the  proportion  of  SIS  questions  that  were  in  the  deeper  levels  of  Bloom’s  taxonomy.  There  was  a 
.64  correlation  between  depth  score  and  the  proportion  of  student  questions  that  were  deep¬ 
reasoning  questions.  Most  of  the  student  questions  were  shallow  (70%)  rather  than  deep  (30%). 
The  most  frequent  deep-reasoning  questions  were  in  the  instrumental/procedural  category  (66% 
of  the  deep-reasoning  questions).  One  important  result  of  the  study  was  that,  overall,  students 
rarely  ask  deep-reasoning  questions  (.30  deep  questions  X  .28  SIS  questions  =  .08  deep¬ 
reasoning  questions,  or  8%).  This  finding  is  compatible  with  one  major  conclusion  in  the  present 
report:  Getting  learners  to  ask  good  questions  is  an  uphill  challenge. 

The  majority  of  student  questions  (72%)  are  not  SIS  questions.  Instead,  they  are 
questions  that  simply  verify  what  they  already  know  {Aren 't  I  supposed  to  add  these  columns?, 
Doesn ’t  a  factorial  design  have  two  independent  variables?)  or  that  address  the  social  interaction 
between  student  and  tutor  {Is  it  my  turn?.  What  did  you  say?.  When  does  this  session  end?).  We 
call  these  common  ground  questions  and  metacommunication  questions,  respectively.  We 
expect  that  these  common  ground  and  metacommunication  questions  will  emerge  frequently  in 
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synchronous  DL,  particularly  when  the  learner  cannot  see  the  instructor.  In  contrast,  we  expect 
them  to  be  less  frequent  in  asynchronous  DL  systems,  because  there  is  a  cost  in  having  a  delayed 
answer  and  because  the  benefits  of  such  communications  are  not  adequate  to  justify  an 
unpredictable  wait  (see  assumption  7  in  Table  2).  SIS  questions  should  be  more  prevalent  in 
asynchronous  than  synchronous  DL  since  time  to  reflect  and  compose  a  question  is  not 
constrained. 

One  can  imagine  a  3-dimensional  grid  that  captures  the  landscape  of  questions  that  could 
potentially  be  asked  about  any  given  subject  matter.  The  three  dimensions  include  the  Type  of 
Knowledge  (see  Table  3),  the  type  of  cognitive  process  (see  Table  4),  and  the  question  category 
(see  Table  5).  Given  the  number  of  categories  in  each  dimension,  there  are  7x7x  18  =  882  cells 
in  the  landscape  of  questions.  Some  of  these  cells  are  more  prevalent  than  others  in  any  given 
subject  matter.  For  example,  consider  procedures,  which  are  step-by-step  recipes  performed  by 
people.  The  shallow  questions  include  concept  completion,  definition,  feature  specification,  and 
quantification.  Deep  questions  include  causal  antecedents  (Why  did  X  occur?,  How  did  X 
occur?.  What  caused  X?),  causal  consequences  (What-if?,  What  if-not?  What  are  the 
consequences  of  X?),  goal  orientation  questions  (Why  do  you  do  X?),  instrumental  procedural 
(How  do  you  do  X?),  and  expectational  (Why  don’t  you  do  X?).  One  can  ask  what  the  next  step 
is  in  the  procedure  (What  happens  after  action  X  is  performed?),  what  action  to  take  if  a 
condition  exists  (What  do  you  do  if  X  exists/occurs),  what  is  the  relative  ordering  of  actions  (X 
before/after/during  Y  vs.  indeterminate),  and  what  the  functions  of  actions  are  (Why  is  it 
important  to  do  step  X?). 

Learning  objectives  vary,  so  a  particular  course  will  skew  the  emphasis  on  some  cells 
more  than  others.  Shallow  knowledge  is  sufficient  in  certain  educational  and  military  training 
contexts,  although  it  should  be  recognized  that  shallow  knowledge  does  not  necessarily  transfer 
well  to  real-world  practical  situations.  For  example,  shallow  learning  is  prevalent  when  many 
people  learn  about  the  components  of  a  computer,  the  jargon,  and  the  relevant  acronyms.  But 
this  shallow  knowledge  will  not  help  much  when  the  person  needs  to  apply  the  computer  in  a  real 
world  task  or  to  repair  the  computer  when  it  malfunctions.  There  are  occasions  when  deep 
knowledge  is  necessary.  Such  knowledge  involves  a  different  distribution  across  the  cells  in  the 
landscape  of  questions.  In  any  event,  we  believe  this  landscape  of  questions  will  be  useful  to 
designers  of  tests  and  distributed  learning  environments  because  it  will  expand  the  horizons  on 
what  questions  are  possible.  The  vast  majority  of  test  questions  are  shallow  in  current 
educational  and  training  practices  (Martinez,  1999).  Our  hope  is  that  the  depth  and  scope  of  the 
questions  will  broaden  with  advanced  research  and  application  that  ingrains  question  generation 
as  a  learning  multiplier. 

Multiple  Choice  Question  Formats 

Questions  can  appear  in  many  different  formats  on  tests.  The  most  popular  formats  are 
multiple  choice,  true-false,  matching,  short  answer  completion,  and  essay.  Multiple  choice  (MC) 
is  the  most  frequent  format  and  is  routinely  adopted  in  tests  constructed  by  Educational  Testing 
Service  (such  as  the  Scholastic  Aptitude  Test),  the  information  technology  training  enterprise 
(such  as  the  certification  exams  for  Microsoft,  Cisco,  and  Novell),  and  the  military  (such  as  the 
Armed  Services  Vocational  Aptitude  Battery).  Essay  questions  are  more  useful  than  MC 
questions  in  diagnosing  cognitive  states  and  tracing  cognitive  processes  (Martinez,  1999),  but 
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properly  constructed  MC  items  can  do  an  excellent  job  tapping  complex  thought  and  deep 
knowledge.  There  is  a  rich  psychometric  tradition  in  the  construction  of  multiple  choice  items 
on  such  tests  {American  Educational  Research  Association,  American  Psychological 
Association,  &  National  Council  on  Measurement  in  Education,  1985;  Department  of  Defense, 
1983;  Downing  &  Haladyna,  1997;  Messeck,  1987).  This  subsection  discusses  how  to  construct 
a  good  MC  /question  because  this  can  serve  training  and  courseware  developers  as  well  as 
instructors. 

For  starters,  it  is  important  to  distinguish  between  format  and  content.  The  format  refers 
to  the  composition  of  the  question  elements,  methods  of  circumventing  comprehension  difficulty, 
and  methods  to  minimize  “giving  away  the  answer”  to  sophisticated  guessers.  The  content  refers 
to  the  level  and  type  of  knowledge  tapped  (see  Table  3)  and  to  cognitive  processes  (see  Table  4). 
The  following  terminology  is  adopted  when  describing  question  format: 

(1)  Context.  Information  that  sets  the  stage  and  precedes  the  question  stem. 

(2)  Stem.  The  focal  question,  without  the  answer  options. 

(3)  Key.  The  correct  answer  among  the  set  of  alternatives. 

(4)  Distracters.  The  incorrect  options  among  the  set  of  alternatives. 

In  most  applications,  there  are  four  or  five  answer  options  and  only  one  correct  answer.  It  is 
recommended  that  the  distracter  items  should  vary  in  distractibility.  One  distracter  should  be  the 
near  miss.  This  is  the  most  seductive  distracter  that  reflects  a  common  misconception  that 
people  have.  The  discrimination  between  the  key  and  the  near  miss  should  reflect  an  important 
learning  objective  or  pedagogical  point  rather  than  being  arbitrarily  subtle  or  merely  tapping 
fnvolous  detail.  The  thematic  distracter  has  content  that  is  related  to  the  topic  at  hand,  but  is  not 
correct.  A  learner  who  quickly  scanned  the  learning  materials  would  have  trouble  discriminating 
the  key,  the  near  miss,  and  the  thematic  distracter.  The  unrelated  distracter  would  seem 
reasonable  to  someone  who  never  read  the  material,  but  is  in  fact  unrelated  to  the  lesson  content. 

Army  Guidance.  Mr.  James  Dees  and  Mr.  Edward  Tyler  of  the  Army  Training  and 
Doctrine  Command  (TRADOC)  provided  guidelines  to  the  Army  Research  Institute  for 
constructing  any  test  question  and  some  guidelines  for  constructing  MC  test  questions  in 
particular.  These  guidelines  are  presented  in  Appendix  B.  It  should  be  apparent  that  there  are 
many  constraints  that  need  to  be  taken  into  consideration  when  writing  a  good  MC  question.  The 
process  is  difficult  and  time-consuming  when  done  correctly. 

The  difficulty  of  constructing  MC  questions  perhaps  explains  one  outcome  of  our 
literature  search  on  QGL.  We  were  interested  in  whether  there  was  any  empirical  research  that 
tested  whether  the  process  of  learners  generating  MC  questions  might  improve  their 
comprehension  of  learning  material.  This  approach  is  being  pursued  by  the  Army  in  conjunction 
with  Athenium,  LLC  in  the  Army  TEAMThink  project,  which  will  be  discussed  later.  Our 
literature  search  came  up  empty,  which  suggests  that  a  project  that  pursues  this  approach  is 
innovative.  It  is  also  likely  to  produce  learning  gains,  given  the  well-documented  effects  of  QGL 
that  were  discussed  earlier.  However,  one  challenge  in  implementing  such  a  project  will  be 
teaching  learners  how  to  ask  MC  questions.  Scaffolding  will  be  needed  to  guide  the  learner  in 
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constructing  questions  that  follow  particular  formats,  constraints,  and  content  features.  Possible 
methods  of  scaffolding  will  be  described  later  in  this  report. 


Empirical  Studies  of  Question  Generation 

There  is  an  idealistic  vision  that  learners  are  vigorous  question  generators  who  actively 
self-regulate  their  learning.  They  identify  their  own  knowledge  deficits,  ask  question  that  focus 
on  these  deficits,  and  answer  the  questions  by  exploring  reliable  information  sources. 
Unfortunately,  this  idealistic  vision  is  an  illusion.  The  sobering  truth  is  that  the  vast  majority  of 
learners  have  trouble  identiying  their  own  knowledge  deficits  (Hacker  et  al.,  1998)  and  ask  very 
few  questions  (Dillon,  1988;  Good,  Slavings,  Harel,  &  Emerson,  1987;  Graesser  &  Person, 

1994).  Greasser  and  Person’s  (1994)  estimate  of  available  studies  revealed  that  the  typical 
student  asks  .1  question  per  hour  in  a  classroom  and  that  the  poverty  of  classroom  questions  is  a 
general  phenomenon  across  cultures.  The  fact  that  it  takes  about  nine  hours  for  a  typical  student 
to  ask  one  question  in  a  classroom  is  perhaps  not  surprising  because  it  would  be  impossible  for  a 
teacher  to  accommodate  constantly  25-30  curious  students.  The  rate  of  question  asking  is  higher 
in  other  learning  environments.  An  average  student  asks  26.5  question  per  hour  in  one-on-one 
human  tutoring  sessions  (Graesser  &  Person,  1994).  Thus,  when  there  is  an  attentive  question 
answerer,  the  rate  of  student  question  asking  goes  up  250-fold. 

The  upper  bound  in  the  number  of  student  questions  per  hour  is  135,  at  least  according  to 
Graesser,  Langston,  and  Bagget  (1993).  The  latter  estimate  is  based  on  college  students 
interacting  with  Point  &  Query  educational  software.  The  only  way  that  these  students  can  learn 
about  a  topic  (e.g.,  woodwind  instruments)  is  by  asking  questions  and  by  reading  answers  to  their 
questions.  To  ask  a  question,  the  learner  first  points  to  a  word,  phrase,  picture  element,  or  other 
hotspot  on  the  computer  screen  in  a  hypertext  system.  Second,  a  menu  of  relevant  questions 
about  the  hotspot  is  presented  to  the  learner  (e.g..  What  does  X mean?.  What  does  X look  like?). 
Third,  the  learner  selects  one  of  the  question  options.  Fourth,  the  answer  is  presented  and  the 
learner  reads  the  answer.  This  cycle  continues  until  the  learning  session  is  finished.  It  is 
extremely  easy  for  a  learner  to  ask  a  question;  it  simply  involves  two  clicks  of  a  mouse,  one  to 
selected  the  hotspot  and  the  other  to  select  the  question.  The  Point  &  Query  software  is  similar 
to  some  other  menu-based  question  asking  systems  (Sebrechts  &,  Swartz,  1991).  In  this  ideal 
learning  environment,  the  student  asks  over  1 ,000  times  as  many  questions  as  in  a  classroom. 

Depth  of  Questions 

One  other  stiking  result  is  that  most  student  questions  are  shallow,  as  we  reported  earlier. 
Student  question  are  normally  shallow,  short-answer  questions  that  recycle  through  the  content 
and  interpretation  of  explicit  material;  they  rarely  tap  the  deeper  levels  of  knowledge 
representation  and  cognitive  processes.  What  the  students  do  mirrors  what  the  teachers  do  in 
classrooms.  Only  about  4%  of  teacher  questions  are  deep  questions  (Kerry,  1987;  Dillon,  1988). 
The  computer  has  the  potential  to  improve  the  quality  in  addition  to  the  quantity  of  student 
questions.  The  quality  of  student  questions  might  improve  in  Point  &  Query  software  that 
presents  only  good  questions  for  the  students  to  model.  Graesser  et  al.  (1993)  found  that  5  times 
as  many  deep-reasoning  questions  were  asked  when  there  were  deep  question  options  on  the 
Point  &  Queiy  question  menu  and  the  students  were  given  a  task  that  required  deep  reasoning. 
Craig,  Gholson,  Ventura,  and  Graesser  (2001)  presented  college  students  conversational 


14 


computer  agents  that  modeled  good  question  asking  behavior.  That  is,  there  were  talking  heads 
that  asked  questions  with  synthesized  speech.  A  control  condition  had  the  talking  heads  deliver 
the  same  content  with  monologues.  There  were  over  twice  as  many  deep-reasoning  questions  in 
the  condition  that  had  the  agents  model  question  asking  than  in  the  control  condition.  Recall  for 
the  content  was  also  significantly  higher  in  the  modeling  condition.  These  findings  confirm  the 
conclusions  presented  earlier  that  learning  gains  can  be  realized  by  training  students  how  to  ask 
questions. 

Group  question  asking  may  provide  an  environment  that  enhances  learning  and  question 
asking.  For  example,  Adafe  (1998)  had  groups  of  students  generate  questions  that  would  appear 
on  an  examination.  It  was  necessary  to  give  some  guidance  to  the  students  on  how  to  construct 
test  questions  and  how  to  conduct  the  group  interactions.  Each  group  contributed  1  or  more 
questions  on  an  examination.  The  participants  were  children  and  the  topic  was  mathematics.  It 
appeared  that  this  method  improved  learning  because  the  percentage  of  grades  of  C  or  better 
increased  fi'om  71%  to  89%  in  a  group  of  30  students.  Unfortunately,  the  evaluation  was  not 
systematic,  so  it  is  an  open  question  whether  group  question  asking  is  effective  in  promoting 
learning  gains.  It  is  difficult  to  determine  whether  any  advantages  could  be  attributed  to  question 
generation  per  se  or  to  group  learning.  Researchers  have  periodically  reported  that  group 
learning  has  several  advantages  over  individual  learning  (Slavin,  1995;  Johnson  &  Johnson, 

1989;  Kagan,  1994).  Springer,  Stanne  and  Donovan  (1999)  conducted  a  meta-analysis  on  37 
studies  in  postsecondary  education,  in  the  areas  of  science,  mathematics,  and  engineering.  The 
effect  sizes  for  small  group  learning  were  .51  for  learning  achievement,  .47  for  persistence  in  the 
course,  and  .50  for  attitude  toward  the  learning  experience,  compared  with  individual  learning. 
The  impact  on  learning  achievement  was  greater  for  instructor-made  exams  (.59)  than 
standardized  instruments  (.33). 

The  low  fi-equency  and  quality  of  learner  questions  have  the  potential  to  improve  in 
synchronous  DL.  An  instructor  can  immediately  answer  student  questions  in  synchronous  DL, 
whereas  this  is  impractical  in  a  classroom.  However,  empirical  data  on  question  asking  in  DL  is 
conspicuously  absent  in  the  literature. 

Collaborative  Question  Generation 

The  Army  Research  Institute  (ARI)  and  TRADOC  are  developing  a  specialized  version 
of  TEAMThink,  a  novel  question-based,  collaborative  learning  application.  TEAMThink,  a 
licensed  product  of  Athenium  LLC,  stimulates  students  to  actively  engage  in  the  creation  of 
questions  through  small  goup  collaboration.  The  questions  are  posed  to  other  students  through  a 
Website.  The  collaborative  question  asking  is  predicted  to  increase  learning  gains  on  the  subject 
matter.  In  accordance  with  a  Technology  Transfer  Memorandum  of  Agreement  between  ARI 
and  TRADOC,  the  benefits  of  TEAMThink  are  being  tested  in  three  Army  schools. 

As  a  baseline  measure,  students  in  the  Captains  Career  Course  at  the  Engineer  School, 
Fort  Leonard  Wood,  Missouri,  participated  in  writing  questions  in  a  four-item  multiple  choice 
format.  Students  composed  one  or  two  questions  on  either  of  two  topics:  the  military  decision 
making  process  and  the  scheme  of  engineer  operatons,  each  covered  recently  in  the  course.  The 
questions  were  composed  offline  without  collaboration.  Of  the  173  questions  generated,  only 
2%  were  considered  not  doctrinally  compliant  and  5%  were  considered  trivial.  Through  the 
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Athenium  Website,  a  subset  of  34  questions  were  posed  to  n=108  students  (exluding  those  from 
Allied  Nations).  The  overall  performance  was  68%  correct.  Categorizing  the  173  questions  to 
the  Graesser  and  Person  taxonomy  (Appendix  A),  45%  were  feature  specification,  34%  were 
instrumental/procedural,  and  8%  were  concept  completion.  Eight  other  categories  had  less  than 
5%  of  the  questions.  Seven  categories,  primarily  those  tapping  deeper  knowledge,  were  never 
used.  This  baseline  will  later  be  compared  to  questions  generated  through  Web-enabled,  small 
group  collaboration. 

In  summary,  available  empirical  research  supports  a  number  of  conclusions  about 
question  generation  and  learning.  QGL  is  effective  in  promoting  learning  gains.  However, 
learners  need  to  be  trained  to  ask  question  because  they  ask  very  few  questions  in  most  learning 
environments  and  most  of  the  questions  are  low  in  quality  (i.e.,  shallow  rather  than  deep).  There 
is  no  systematic  research  on  collaborative  question  generation  in  groups  and  the  resulting  effect 
on  learning.  There  is  no  systematic  research  on  the  impact  of  generating  questions  during 
comprehension  on  learning  gains.  MC  questions  have  many  constraints  in  format  and  content,  so 
they  are  too  difficult  for  students  to  generate  without  scaffolding  of  instruction.  Research  on 
question  asking  in  DL  is  conspicuously  absent. 

Now  that  we  have  covered  the  theoretical  and  empirical  research  on  question  asking  and 
learning,  the  focus  will  shift  to  the  role  of  questions  in  future  DL  systems.  The  remaining 
sections  should  be  regarded  as  “informed  speculation.”  It  is  informed  because  of  the  existing 
body  of  research  reported  in  the  first  two  sections.  It  is  speculative,  however,  because  research 
on  questions  is  conspicuously  absent  in  the  current  DL  environments.  Existing  empirical  data 
are  so  sparse  and  fragmented  that  we  will  have  to  rely  on  basic  research  and  theory  to  support  our 
speculations.  The  fact  that  systematic  research  on  QGL  in  DL  environments  is  so  sparse  is  not 
surprising  given  that  there  have  been  few  evaluations  of  DL  environments  with  adequate  controls 
and  control  groups  (Wisher,  Champagne,  Pawluk,  Eaton,  Thornton,  &  Cumow,  1999). 


Practices  for  Increasing  the  Frequency  and  Quality  of  Questions 

Learners  clearly  need  some  scaffolding  to  assist  them  in  asking  questions.  Progressively 
more  scaffolding  will  be  needed  to  encourage  (a)  any  question  at  all,  (b)  deep  questions,  and  (c) 
multiple  choice  questions.  Deep  knowledge  and  format  constraints  require  additional  help.  In 
this  section,  we  propose  methods  for  increasing  the  frequency  and  quality  of  questions.  The 
methods  are  in  no  particular  order  of  importance. 

Practice  1:  Clarify  the  learning  objectives  and  test  criteria 

The  learner  can  evaluate  what  questions  are  relevant  if  they  know  these  objectives  and 
criteria.  Without  this  clarity,  there  are  barriers  in  the  question  generation  stages  of  identifying 
knowledge  deficits  and  social  editing.  Some  learning  objectives  require  only  shallow  and 
procedural  knowledge,  such  as  assembling  a  new  computer  and  logging  onto  a  network.  The  test 
is  whether  computers  get  assembled  and  the  learner  successfully  logs  onto  the  system.  Other 
learning  objectives  require  deep  knowledge  about  causal  mechanisms,  as  in  the  case  of 
diagnosing  and  repairing  equipment.  Designers  of  learning  environments  should  provide 
example  test  questions  that  appropriately  map  onto  the  relevant  cells  in  the  matrix  of  questions 
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appropriate  to  a  training  program. 

Practice  2:  Present  challenges  that  create  cognitive  disequilibrium 

SIS  questions  are  triggered  by  cognitive  disequilibrium,  as  discussed  in  an  earlier  section. 
The  learner  needs  to  be  presented  challenges  in  the  form  of  obstacles  to  goals,  anomalous  events, 
contradictions,  discrepancies,  obvious  gaps  in  knowledge,  and  decisions  that  require 
discrimination  among  equally  attractive  alternatives.  Such  challenges  are  most  engaging  when 
they  are  not  too  easy  or  too  difficult,  but  are  at  the  learner’s  zone  of  proximal  development 
(Rogoff,  1990;  Vygotsky,  1978).  This  can  be  accomplished  by  having  the  DL  system  track  the 
learner’s  mastery  of  the  material  and  present  challenges  that  are  tailored  to  the  learner’s  profile 
(within  the  “zone”).  The  challenges  consist  of  example  problems  to  solve,  procedural  tasks  to 
execute,  and  breakdown  scenarios  to  fix. 

Practice  3:  Give  feedback  on  particular  comprehension  deficits 

Most  learners  do  not  have  good  comprehension  calibration  skills  so  they  do  not  notice  the 
knowledge  deficits  that  would  otherwise  drive  questions.  Feedback  on  particular  knowledge 
deficits  ends  up  penetrating  the  “illusion  of  comprehension”  and  promoting  deeper 
comprehension  and  also  deeper  questions.  Intelligent  tutoring  systems  (ITS)  are  capable  of 
recognizing  bugs,  misconceptions,  and  knowledge  deficits  at  a  fine-grained  level  (Lesgold  et  al., 
1992;  Graesser,  VanLehn,  Rose,  Jordan,  &  Harter,  in  press),  so  the  ITS  technology  would  be  one 
source  of  implementing  this  method  in  TADLP. 

Practice  4:  Present  examples  of  good  questions 

The  learner  can  acquire  good  question  asking  skills  by  modeling  good  question  askers. 
Example  good  questions,  in  whatever  format,  can  be  available  in  relevant  cells  in  the  landscape 
of  questions.  The  learner  can  observe  these  question  items  by  pointing  to  different  hotspots  on 
the  graphical  user  interface  (GUI)  or  in  a  help  facility.  Alternatively,  pairs  of  avatars  can  exhibit 
good  questions  in  virtual  dialogues.  These  modeling  approaches  have  proved  effective  in 
improving  the  quantity  and  quality  of  questions. 

Practice  5:  Present  generic  question  frames 

Generic  question  frames  (e.g..  What  does  X mean?,  How  do  you  do  X?,  see  Table  5) 
guide  the  user  in  selecting  and  articulating  questions.  They  are  pitched  at  a  somewhat  more 
abstract  level  than  actual  questions,  but  there  is  a  finite  number  of  frames  that  are  acquired  by  the 
learner.  The  Point  &  Queiy  interface  adopts  these  generic  question  frames  and  substantially 
improves  the  quantity  and  quality  of  learner  questions.  The  learner  discovers  categories  of  good 
questions  by  examining  the  options  on  the  question  menu.  The  software  designer  can  tailor  the 
question  options  to  fulfill  the  learning  objectives  and  the  subject  matter  constraints. 

Practice  6:  Use  conversational  agents  to  scaffold  the  construction  of  questions 

Researchers  have  developed  computer-generated,  animated,  talking  heads  that  have  facial 
features  synchronized  with  synthesized  speech  and  appropriate  gestures  (Cassell  &  Thorisson, 
1999;  Cohen  &  Massaro,  1994;  Johnson,  Rickel,  &  Lester,  2000).  These  conversational  agents 
have  been  used  as  navigators  on  web  pages,  as  narrators,  as  avatars,  and  as  tutors  in  intelligent 
tutoring  systems.  For  example,  AutoTutor  (Graesser,  Wiemer-Hastings,  Wiemer-Hastings, 
Kreuz,  &  TRG,  1999)  is  a  conversational  agent  that  teaches  students  about  introductory 
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computer  literacy  by  holding  a  conversation.  Dialog  moves  in  AutoTutor  include  backcharmel 
feedback  (Uh-huh,  Okay),  pumps  {Tell  me  more),  prompts  (primary  memory  includes  ROM  and 

_ ),  assertions,  hints,  corrections,  summaries,  questions,  and  a  variety  of  other  speech  acts.  A 

tum-by-tum  dialog  scaffolds  the  learner  to  articulate  information.  Steve  (Rickel  &  Johnson, 
1999)  is  an  avatar  in  virtual  reality  that  shows  the  learner  how  to  operate  equipment,  answers 
learner  questions,  and  offers  suggestions. 

Conversational  agents  could  be  developed  to  guide  learners  in  articulating  questions  in 
various  formats,  including  MC  questions.  For  example,  the  agent  could  follow  the  following 
script  that  guides  the  user  in  generating  a  MC  question: 

Step  1 :  Select  a  question  from  a  cell  in  the  landscape  of  questions. 

Step  2:  Show  an  example  question  in  a  selected  cell 

Step  3:  Generate  a  question  stem. 

Step  4:  Generate  the  key. 

Step  5:  Generate  a  near  miss  distracter. 

Step  6:  Generate  a  thematic  distracter. 

Step  7:  Generate  a  remote  distracter. 

Each  step  would  be  augmented  by  a  conversational  finite  state  transition  network  (Graesser, 
Person,  Harter,  &  TRG,  2000;  Jurafsky  &  Martin,  2000)  that  allows  learners  to  ask  clarification 
questions  (e.g..  What  is  a  stem?),  that  answers  these  questions,  that  makes  suggestions  (At  this 
point,  you  need  to  generate  a  stem),  that  gives  feedback  (uh-huh,  that ’s  right,  okay),  and  that 
gives  hints  (Is  there  an  important  misconception  that  motivates  this  near  miss?).  A 
conversational  agent  for  question  generation  is  well  within  the  realm  of  current  technologies. 

Practice  7:  Have  groups  generate  questions  collaboratively 

In  the  TEAMThink  project  on  group  question  generation,  one  student  generates  a  MC 
question  and  key,  whereas  partners  in  a  small  group  critique  the  question  as  well  as  the  proposed 
distracters.  Later  on,  the  generated  questions  are  used  on  tests  with  other  small  groups. 
Alternative  role  assignments  are  currently  being  explored  in  the  project.  For  example,  a  third 
student  might  revise  the  question,  in  light  of  the  feedback  from  the  partner.  So,  there  potentially 
could  be  a  question  asker,  a  critiquer,  and  a  reviser.  There  also  could  be  feedback  from  an  expert 
composer  before  a  final  question  is  revised.  One  possible  augmentation  would  be  to  have  a 
competitive  game  in  which  groups  compete  in  their  generation  of  questions.  Another 
performance-based  assessment  would  be  to  have  other  learners,  in  other  groups,  answer  the 
question  and  evaluate  the  question  on  discriminative  validity  (high  performing  students 
answering  it  correctly,  low  performing  students  missing  it).  Learners  could  receive  feedback  on 
the  question  quality  according  to  psychometric  indices. 

There  are  other  role  assignments  that  might  produce  more  questions,  better  questions, 
and/or  more  learning.  One  learner  could  generate  the  question  stem,  a  second  the  key,  a  third  the 
near  miss,  a  fourth  the  thematic  distracter,  and  a  fifth  the  remote  distracter.  Each  learner  would 
also  justify  what  is  produced  from  the  standpoint  of  content  or  question  quality.  This  approach 
might  sharpen  important  distinctions  and  yield  deeper  learning. 
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Practice  8:  Concretize  the  author 

The  author  of  a  document  or  lesson  is  normally  invisible.  Many  readers  assume  that 
printed  text  is  indisputable  truth,  rather  than  it  being  the  best  possible  account  that  an  author 
could  prepare  at  a  particular  point  in  time.  Beck  et  al.  (1997)  have  designed  a  Questioning  the 
Author  method  that  trains  students  to  imagine  an  author  in  flesh  and  blood  and  to  question  what 
the  author  says.  Why  does  the  author  make  a  particular  claims?  What  evidence  is  there?  How 
does  one  idea  in  a  text  relate  to  another  idea?  Why  would  the  author  express  something  in  a 
particular  way?  This  Questioning  the  Author  method  improves  questions,  comprehension,  and 
metacomprehension.  So  a  picture  of  an  author  and  example  questions  that  challenge  what  the 
author  says  might  be  incorporated  in  DL  applications. 

Practice  9:  Use  computer-generated  critiques  of  questions 

Advances  in  computation  linguistics,  corpus  linguistics,  and  natural  language  processing 
in  artificial  intelligence  have  made  it  feasible  to  design  computer  facilities  that  critique  questions 
on  quality.  Some  of  these  advances  will  be  discussed  in  the  next  section.  Questions  can  be 
automatically  classified  according  to  the  Graesser  and  Person  (1994)  question  categories  (see 
Table  5).  Once  classified,  they  can  be  evaluated  on  depth  in  the  critique. 

Answer  options  in  MC  questions  can  be  evaluated  on  correctness  using  latent  semantic 
analysis  (LSA)  (Landauer,  Foltz,  &  Laham,  1998;  Foltz,  Gilliam,  &  Kendall,  2000),  a  technique 
that  will  be  discussed  further  in  the  next  section.  Landauer  and  Dumais  (1997)  created  an  LSA 
representation  with  300  dimensions  from  4.6  million  words  that  appeared  in  30,473  articles  in 
Grolier ’s  Academic  American  Encyclopedia.  They  submitted  to  the  LSA  representation  the 
synonym  portion  of  the  TOEFL  test,  a  test  developed  by  the  Educational  Testing  Service  to 
assess  how  well  non-native  English  speakers  have  mastered  the  words  in  the  English  language. 
The  test  has  a  four-alternative,  forced  choice  format,  so  there  is  a  25%  chance  of  answering  the 
questions  correctly.  The  LSA  model  selected  the  alternative  that  had  the  highest  match  with  a 
comparison  word.  The  LSA  model  answered  64.4%  of  the  questions  correctly,  which  is 
essentially  equivalent  to  the  64.5%  performance  for  college  students  from  non-English  speaking 
countries.  There  was  also  a  significant  correlation  between  the  relative  likelihood  that  the  3 
distracters  were  selected  and  the  likelihood  that  humans  found  them  to  be  seductive  lures.  In 
another  study,  Foltz  et  al.  (2000)  created  a  300  dimensional  space  that  captured  a  text  on 
introductory  psychology  and  then  submitted  the  LSA  representation  to  the  MC  questions  used  in 
the  class.  The  LSA  space  received  a  grade  of  C-.  One  could  imagine  having  learners  get 
feedback  on  their  own  MC  questions  by  inspecting  the  likelihood  that  LSA  selects  each 
alternative.  If  the  key  had  roughly  the  same  LSA  score  as  the  near  miss,  then  that  would  be  a 
desirable  feature,  as  long  as  there  is  a  principled  reason  that  the  near  miss  is  incorrect. 

Application  Areas 

This  section  identifies  several  application  areas  for  question  generation  as  a  learning 
multiplier  in  distributed  learning  environments.  We  will  focus  on  distributed  learning 
environments  that  are  potentially  useful  to  the  military,  including  the  reuse  of  sharable  content 
through  application  of  the  Sharable  Content  Object  Reference  Model  (SCORM). 
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Communication  Channels  and  Media.  One  important  consideration  is  the 
communication  medium  between  questioner  and  answerer.  In  face-to-face  (FTF)  interaction, 
there  is  both  temporal  and  spatial  contiguity,  so  the  speech  participants  co-experience  the 
communication  and  the  conversational  setting.  It  is  easy  for  the  listener  to  give  backchannel 
feedback  (speech,  head  nods,  gestures,  facial  expressions)  while  the  speaker  talks.  Such 
feedback  serves  to  repair  the  speaker’s  conversation,  to  fill  in  the  speaker’s  words,  to  speak 
simultaneously,  and  to  perform  other  collaborative  acts  that  are  well  known  in  the  speech  and 
discourse  literature  (Clark,  1996;  Fox,  1993).  In  FTF  conversation,  speakers  routinely  ask 
counter-clarification  questions  {Which  disk  are  you  talking  about?)  and  produce  metacognitive 
utterances  {I  see.  That  makes  sense.  Do  you  follow?  Okay?).  These  questions  solidify  common 
ground  between  speech  participants,  a  grounding  that  facilitated  further  by  a  shared  physical 
environment.  In  FTF  interaction,  it  is  not  normally  necessary  to  produce  metacommunication 
utterances  {Could  you  repeat  that?  Do  you  hear  me?  I’ll  say  a  few  things  now)  because  there  are 
virtually  no  barriers  in  the  communication  channel.  In  FTF  interaction,  there  are  comparatively 
few  words  per  turn  because  it  is  easy  to  negotiate  turn-taking  through  intonation,  facial 
expressions,  gestures,  tight  temporal  coupling,  and  floor  maintenance  utterances  {umm,  uhh, 
soo..). 


Alternative  media  present  barriers  that  are  absent  in  FTF  interaction.  Asynchronous  DL 
is  at  the  opposite  end  of  the  spectrum  because  the  interaction  is  neither  temporally  nor  spatially 
contiguous.  Hillman  (1999)  has  explored  some  of  the  differences  in  discourse  patterns  between 
these  two  media.  In  their  analysis  of  computer  classes,  teachers  spoke  73%  of  the  sentences  in 
FTF  interaction  but  only  49%  in  asynchronous  DL.  This  strongly  suggests  that  students  can  be 
more  active  constructors  of  knowledge  in  asynchronous  DL.  Organizing  information  and  the 
lesson  is  more  explicit  in  DL  than  FTF  because  of  the  lack  of  nonverbal  channels  and  the  lack  of 
immediacy  of  the  interaction.  However,  Hillman  did  not  analyze  the  discourse  in  the  level  of 
detail  that  is  routinely  conducted  by  researchers  in  the  field  of  discourse  processing.  The 
discourse  acts  that  should  be  infrequent  in  asynchronous  DL  are  backchannel  feedback, 
conversational  repair,  metacognitive  utterances,  and  metacommunicative  utterances. 

Responses  to  such  acts  have  comparatively  low  information  value,  whereas  there  is  a  high 
cost  in  the  form  of  wait  time  and  miscommunication  (because  of  the  absence  of  shared  context); 
since  the  benefits  do  not  exceed  the  costs,  these  questions  would  not  be  asked  (see  assumption  7 
in  Table  2).  The  number  of  words  per  turn  should  be  much  longer  in  asynchronous  DL  because 
of  the  need  to  create  context  and  to  optimize  the  information  load  per  turn.  Perhaps  this  problem 
could  be  minimized  with  static  pictures  of  the  instructor  in  addition  to  the  materials,  an  expanded 
form  of  audiographics.  This  might  work  for  person-to-person  communication,  but  not  when  the 
learner  is  interacting  with  a  generic  trainer,  and  a  trainer  is  interacting  with  a  generic  learner.  It 
is  interesting  to  point  out  that  deep  SIS  questions  are  predicted  to  be  more  prevalent  in 
asynchronous  DL  because  there  should  be  a  high  load  of  information  per  turn,  there  is  more  time 
to  compose  questions,  and  there  is  more  time  to  reflect  and  plan  during  question  composition. 
There  should  be  fewer  questions,  but  higher  quality  questions  in  asynchronous  DL  than  FTF. 

The  media  in  between  FTF  and  asynchronous  DL  present  interesting  research  issues  from 
the  standpoint  of  question  generation  and  discourse.  Wisher  et  al.  (1999)  identified  the  various 
communication  media  and  reported  their  usage  and  effectiveness  in  recent  DL  applications. 
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Clark’s  (1996)  analysis  of  discourse  and  communication  media  provides  an  excellent  theoretical 
guide  for  analyzing  the  different  forms  of  synchronous  DL.  Speech  participants  need  to  monitor 
common  ground  while  they  communicate,  so  there  needs  to  be  feedback  among  them  on  what 
others  think  about  what  gets  said.  There  are  different  degrees  of  feedback  when  speaker  #1 
expresses  message  M  to  speaker  #2  (see  also  Graesser  et  al.,  1999): 

(1)  I  am  near  you 

(2)  I  am  listening 

(3)  I  hear  you 

(4)  I  acknowledge  that  you  are  talking 

(5)  I  understand  what  you  say 

(6)  I  understand  what  you  say,  and  I’ll  demonstrate  it  by  expanding  upon  it. 

(7)  I  agree  with  what  you  say 

(8)  I  have  an  attitude  toward  what  you  say 

In  FTP  interaction,  utterance  number  1  can  be  accomplished  by  spatial  proximity,  number  2  by 
the  listener  facing  the  speaker,  number  3  by  facial  expressions  that  respond  to  the  other’s  speech, 
number  4  by  any  form  of  backchannel  feedback,  number  5  by  head  nodding  or  positive  spoken 
utterances  {yeah,  I  see),  number  6  by  informative  assertions  that  are  relevant  to  the  topic,  number 
7  by  explicit  agreement  or  more  dramatic  forms  than  in  number  5,  and  number  8  by  emotional 
expressions,  sarcasm,  or  humor. 

These  levels  can  be  quickly  accomplished  in  FTF  interaction,  but  must  be  more  overtly 
and  explicitly  expressed  in  other  communication  media  (Vrasidas  &  Mclsaac,  1999).  For 
example,  consider  two-party  telephone  conversations.  Number  1  is  impossible  but  the  others  are 
routinely  accomplished  by  the  speech,  pauses,  and  intonation  patterns,  but  not  by  body 
positioning,  facial  expressions,  gestures,  and  interacting  with  the  setting.  In  conference  calls 
with  several  people,  there  is  uncertainty  about  who  is  present,  so  it  is  polite  for  everyone  to 
introduce  themselves.  It  is  more  difficult  for  everyone  to  reconstruct  who  the  primary  speakers 
are.  There  needs  to  be  more  metacommunicative  utterances  {Can  you  hear  me?  Who  is 
speaking?  Can  Isay  something?)  that  manage  the  conversation  and  handle  2,  3,  and  4.  There 
needs  to  be  more  metacognitive  utterances  (I’m  not  following?,  Could  you  say  that  in  English? 
Could  I  think  about  that  for  awhile?).  This  is  all  accomplished  with  speech,  pauses,  and 
intonation  contours.  The  features  of  the  telephone  artifact  also  play  a  role,  of  course.  If  there  is  a 
dial  tone  in  the  middle  of  the  conference  call,  then  that  disables  2-8.  A  dial  tone  after  a  heated 
exchange  satisfies  8.  The  conversational  flow  and  understanding  can  be  disrupted  with  poor 
audio  quality,  such  as  delays  for  transmission,  delayed  auditory  feedback,  and  signal  degradation 
that  changes  intonation  parameters. 

Adding  the  visual  modality  to  the  audio  modality  in  synchronous  DL  is  expected  to  make 
it  easier  for  speech  participants  to  communicate,  even  though  such  improvements  do  not 
necessarily  improve  learning  (Wisher  &  Cumow,  1999).  The  review  of  DL  by  Wisher  et  al. 
(1999)  reported  that  57%  of  the  DL  research  conducted  in  training  environments  was  video 
teletraining  with  two-way  video  and  two-way  audio.  The  video  channel  opens  up  facial 
expressions,  gestures,  body  positioning,  and  interactions  with  the  setting.  Consequently,  we 
would  expect  fewer  metacommunication  and  metacognitive  questions  when  the  video  media  are 
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added.  One  of  the  challenges  has  been  to  coordinate  the  timing  between  the  speech  and  visual 
media.  Many  individuals  get  irritated  when  facial  expressions,  gestures,  and  mouth  movements 
are  out  of  synch  with  the  speech.  It  is  conceivable  that  deep  SIS  questions  are  more  prevalent 
without  the  visual  modality.  Speakers  do  not  have  to  sustain  polite  body  postures  and  facial 
expressions  when  there  is  no  visual  media,  so  there  are  more  cognitive  resources  available  for 
deeper  thinking.  Research  in  neurolinguistics  has  revealed  that  when  people  are  deep  in  some 
forms  of  thought,  they  try  to  remove  visual  stimuli  by  closing  their  eyes  or  looking  up  at  the 
ceiling. 


In  the  future,  there  needs  to  be  more  research  that  empirically  investigates  question 
generation  while  adults  use  these  different  media  and  communication  channels.  This  section  has 
identified  some  theoretical  reasons  for  expecting  differences,  but  the  empirical  research  has  yet 
to  be  conducted. 

Reducing  the  barriers  between  questions  and  answers.  An  earlier  subsection  of  this 
report  discussed  methods  of  minimizing  barriers  to  question  generation  and  maximizing  the 
quantity  and  quality  of  questions.  The  present  subsection  addresses  barriers  between  a  question 
that  gets  asked  and  useful  answers.  A  fi'equent  complaint  about  contemporary  information 
retrieval  systems  is  that  the  performance  of  these  systems  is  poor.  It  is  difficult  to  pose  questions 
that  end  up  supplying  useful  information.  Recently,  at  a  text  retrieval  conference  on  question 
answering,  sixteen  systems  were  entered  in  a  competition.  A  set  of  questions  were  asked,  the 
query-retrieval  systems  attempted  to  fetch  the  answers  from  the  web,  and  then  the  top  five 
answers  were  examined  for  each  questions.  A  recall  score  was  defined  as  the  likelihood  that  the 
correct  answer  was  among  the  top  five  answers.  Recall  scores  were  moderately  impressive 
(around  50%)  for  shallow  questions  with  short  answers  (concept  completion,  quantification),  but 
notoriously  poor  for  deep  questions  that  required  long  answers  {why,  how). 

Quite  clearly,  learners  will  give  up  asking  questions  if  the  system  fails  to  deliver  useful 
information.  If  assumptions  number  8, 10,  or  1 1  in  Table  2  are  violated,  then  the  learner  will 
quickly  give  up  using  the  tool.  Instead,  they  will  ask  a  human  (who  may  deliver  faulty 
information)  or  will  give  up  asking  any  question  at  all.  This  was  a  painful  lesson  in  our  early 
assessments  of  AutoTutor  (Graesser  et  al.,  1999;  Graesser,  VanLehn,  et  al.,  in  press),  the  tutoring 
system  that  we  developed  to  teach  college  students  computer  literacy  by  holding  a  conversation 
with  them.  AutoTutor  could  handle  definitional  questions  and  verification  questions,  but  not  any 
of  the  other  categories  in  Table  5.  When  AutoTutor  couldn’t  handle  a  question,  it  answered 
That ‘s  a  good  question,  but  I  can ’t  answer  it.  This  had  to  happen  only  once  or  twice  before  a 
student  stopped  asking  any  SIS  questions  altogether.  Similarly,  in  the  classroom,  student 
questions  may  not  be  understood  by  the  teacher,  so  the  utility  of  the  answers  may  be  marginal; 
this  may  partially  explain  why  question  asking  in  some  children  quickly  extinguishes  shortly 
after  they  enter  school.  Users  quickly  evaluate  whether  an  information  system  is  delivering 
useful  information,  whether  the  information  system  is  a  web  facility,  a  textbook,  or  a  human 
expert.  There  needs  to  be  an  extremely  high  benefit-to-cost  ratio  for  adults  to  stay  motivated 
asking  questions.  One  direction  for  the  future  is  to  substantially  increase  this  ratio. 
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Reuse  of  questions.  One  goal  of  the  Advanced  Distributed  Learning  (ADL)  Initiative 
(viz.,  www.adlnet.org)  is  to  develop  learning  content  as  component  units,  called  learning  objects, 
that  can  be  tagged,  coded  in  a  repository,  identified,  and  reused  or  repurposed  for  other 
instructional  applications.  The  SCORM,  a  reference  model  based  on  emerging  e-leaming 
industry  standard  specifications,  permits  interoperability  of  learning  content  and  learning 
management  systems.  The  model  includes  a  specification,  developed  by  the  IMS  Global 
Learning  Consortium,  for  question  and  test  interoperability  (QTI).  The  specification  defines  a 
standard  format  that  allows  interoperability  for  questions  and  tests  between  different  computer 
systems.  Computer  software  that  supports  QTI  will  allow  export  into  and  import  from  this 
format,  so  that  if  you  computerize  questions  or  tests  on  one  system,  then  the  material  will  also  be 
usable  on  another  system.  This  allows  exchange  of  questions  between  learning  management 
systems,  content  authors,  content  libraries,  and  other  forms  of  question  pools. 

The  specification  does  not  limit  product  designs  by  prescribing  certain  user  interfaces, 
pedagogical  paradigms,  or  policies  that  constrain  innovative  use  of  question-generation  learning 
strategies.  The  QTI  specification  does  afford  the  opportunity  to  capture  in  a  standard  format  the 
questions  generated  during  the  course  of  distributed  learning.  Once  in  this  format,  they  may  find 
multiple  uses  to  enhance  a  learning  experience  as  described  elsewhere  in  this  report. 


Summary 

This  report  provides  a  rationale  for  question  generation  as  a  viable  learning  multiplier  in 
training  environments.  The  rationale  was  derived  from  a  thorough  review  of  recent  research  on 
questioning  from  multiple  perspectives:  psychology,  information  systems  design,  cognitive 
science,  and  computational  linguistics.  Based  on  this  review,  nine  practices  were  identified  for 
immediate  use  in  both  classroom  and  distributed  learning  settings.  Furthermore,  an  application 
area  that  builds  from  a  base  of  frequently  asked  questions  during  training  (or  even  those  posed  at 
the  workplace)  was  identified  as  a  key  focus  for  continuing  experimentation  and  development. 

If  employed  properly,  question  generation  strategies  in  DL  can  increase  a  soldier’s  depth 
of  understanding  about  the  workings  of  a  complex  system.  This  is  beneficial  in  terms  of  not  only 
higher  immediate  learning  outcomes  but  also  in  improved  retention,  as  degree  of  original 
learning  is  the  best  single  predictor  of  knowledge/skill  retention  (Wisher,  Sabol,  and  Ellis,  1999). 
The  advantages  of  question  generation  in  training  are  particularly  important  to  the  Army.  This  is 
especially  so  as  the  Army  transitions  to  the  complexities  of  a  digital  battleforce  while 
transforming  its  training  delivery  to  a  soldier-centric  DL  mode.  A  similar  argument  was  made 
separately  for  the  potential  of  collaborative  learning  tools  in  soldier-centric  settings  (Bonk  and 
Wisher,  2000). 

The  provisions  for  answering  the  questions  students  pose  in  DL  settings,  particularly 
asynchronous  DL,  are  not  clear  at  this  time.  What  is  known  is  that  the  deeper  the  question  and 
the  more  thoughtful  the  response  the  better  the  learning.  The  potential  is  great.  Rather  than 
being  an  afterthought  to  future  training  systems,  a  question  generation  mechanism  should  be 
investigated  as  an  essential  feature. 
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APPENDIX  A 


Question  Taxonomy  Proposed  by  Graesser  and  Person  (1994). 

The  table  below  presents  the  18  questions  categories  that  are  based  on  question  content 
and  that  were  sincere  information-seeking  (SIS)  questions. 


QUESTION  CATEGORY 
EXAMPLES 

1.  Verification 

2.  Disjunctive 

3.  Concept  completion 

4.  Feature  specification 

5.  Quantification 

6.  Definition  questions 

7.  Example  questions 

8.  Comparison 

9.  Interpretation 

10.  Causal  antecedent 


1 1 .  Causal  consequence 

12.  Goal  orientation 
action? 

13.  Instrumental/procedural 

14.  Enablement 

15.  Expectation 

1 6.  Judgmental 

17.  Assertion 

18.  Request/Directive 


GENERIC  QUESTION  FRAMES  AND 

Is  X  true  or  false?  Did  an  event  occur?  Does  a 
state  exist? 

Is  X,  Y,  or  Z  the  case? 

Who?  What?  When?  Where? 

What  qualitative  properties  does  entity  X  have? 
What  is  the  value  of  a  quantitative  variable?  How 
much?  How  many? 

What  does  X  mean? 

What  is  an  example  or  instance  of  a  category?). 

How  is  X  similar  to  Y?  How  is  X  different  fi’om  Y? 
What  concept  or  claim  can  be  inferred  fi’om  a  static 
or  active  pattern  of  data? 

What  state  or  event  causally  led  to  an  event  or  state? 
Why  did  an  event  occur?  Why  does  a  state  exist? 
How  did  an  event  occur?  How  did  a  state  come  to 
exist? 

What  are  the  consequences  of  an  event  or  state? 
What  if  X  occurred?  What  if  X  did  not  oceur? 

What  are  the  motives  or  goals  behind  an  agent’s 

Why  did  an  agent  do  some  action? 

What  plan  or  instrument  allows  an  agent  to 
accomplish  a  goal?  How  did  agent  do  some  aetion? 
What  object  or  resource  allows  an  agent  to 
accomplish  a  goal? 

Why  did  some  expected  event  not  occur? 

Why  does  some  expected  state  not  exist? 

What  value  does  the  answerer  place  on  an  idea  or 
advice? 

What  do  you  think  of  X?  How  would  you  rate  X? 

A  declarative  statement  that  indicates  the  speaker 
does  not  understand  an  idea. 

The  questioner  wants  the  listener  to  perform  some 
action. 
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APPENDIX  B 


TRADOC  Guidelines  for  Preparing  Multiple  Choice  Test  Items 

Lessons  Learned  for  All  Test  Items 

•  Write  the  items  in  preliminary  form  during  the  instructional  system  development  period. 

•  Use  a  test  blueprint  or  outline  to  keep  an  appropriate  relationship  between  the  items  on  the  test 
and  the  instructional  objectives. 

•  Base  each  test  item  on  an  important  point,  idea,  or  skill. 

•  Write  items  to  measure  understanding  or  ability  to  apply  principles. 

•  Test  one,  and  only  one,  point  or  idea  per  test  item. 

•  Write  items  that  require  specific  knowledge  of  material  studied,  not  items  that  require  general 
knowledge  or  experience. 

•  Use  clear  and  concise  language  that  is  appropriate  for  the  conceptual  difficulty  level  of  the 
specific  objective  being  tested. 

•  Present  each  test  item  task  as  simply  and  straightforwardly  as  possible. 

•  Keep  test  items  free  of  extraneous,  ambiguous,  or  confusing  material. 

•  Keep  test  items  free  of  tricky  expressions,  slang,  or  other  tricky  requirements. 

•  Review  test  items  from  other  sources,  such  as  textbooks,  and  other  instructors. 

•  Use  original  language,  not  that  found  in  textbooks  or  other  instructional  materials  for  the  course. 

•  Eliminate  any  clues  within  the  test  item,  or  clues  that  relate  to  other  items  in  the  test. 

•  Be  especially  sensitive  to  clues  or  suggestions  that  could  help  a  naive  examinee  (one  who  does 
not  have  the  knowledge  or  skill  that  should  be  able  to  answer  the  item  correctly). 

•  For  tests  that  measure  discrimination,  concrete  concept,  or  defined  concept  intellectual  skills, 
make  each  test  item  independent  of  other  items.  Ensure  that  the  answer  to  one  test  item  is  not 
dependent  on  the  answer  to  other  test  items.  (Some  tests  that  measure  rule  learning  or  problem 
solving  intellectual  skills,  verbal  information  skills,  cognitive  strategy  skills,  or  the  memorization 
component  of  psychomotor  skills  may  be  designed  to  require  correct  answers  to  a  sequence  of 
test  items.  These  tests  require  that  the  correct  answer  to  one,  or  a  series  of  test  items,  is 
dependent  on  the  correct  answers  to  other  previous  test  items). 

•  Ensure  that  test  items  are  reviewed  by  other  instructors  and  content  specialists  to  help  eliminate 
ambiguity,  technical  errors,  or  other  errors  in  the  test  item. 

•  Ensure  test  items  are  reviewed  by  individuals  who  are  not  content  specialists  for  ambiguity,  clues 
for  naive  examinees,  and  for  selected-response  test  items,  plausibility  for  the  naive  examinee. 

•  Ensure  that  the  test  item  has  “face  validity”,  measures  a  specific  objective,  and  relates  to  the 
content  studied  in  the  course  of  instruction. 

•  Avoid  the  appearance  of  bias  in  the  test  item  (e.g.,  race,  gender,  cultural,  ethnic,  regional, 
handicapped,  age-group,  or  other  apparent  bias). 

•  Construct  test  items  that  have  a  clearly  correct  or  clearly  best  answer. 

•  Follow  standard  rules  of  punctuation  and  grammar. 

•  For  test  items  based  on  an  opinion  or  authority,  state  whose  opinion  or  what  authority. 

•  Do  not  require  uimecessarily  exact  or  difficult  operations.  Test  items  should  match  objective 
criterion  standards. 

•  Do  not  use  specific  determiners  such  as  “always”,  “never”,  “none”,  and  “all”  in  test  items. 

•  Restrict  the  number  of  different  item  formats  in  a  test.  Use  the  most  valid  formats.  Group  items 
in  the  same  format  together. 

•  Use  scenarios,  pictorial  material,  or  other  graphics  only  when  they  are  relevant  to  an  objective  or 
topic  measured  by  the  test  item,  and  only  when  required  for  the  test  item  to  effectively  measure 
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an  intellectual  or  psychomotor  skill. 

•  When  a  scenario,  picture,  or  graphic  is  used,  provide  specific  test  item  directions  referring  to  it. 

•  If  scenarios  are  used  for  a  test  item,  ensure  that  they  are  realistic  and  appropriate  for  the  test  item. 

•  If  pictorial  material  or  other  graphics  are  used  for  a  test  item,  ensure  that  they  are  clearly  drawn 
and  labeled. 

Lessons  Learned  for  Multiple-Choice  Test  Items 

•  Write  the  stem  so  it  clearly  defines  the  test  item  task.  Word  the  stem  so  that  the  examinee  knows 
what  is  required  without  seeing  the  response  options.  Generally,  writing  the  stem  as  a  question 
helps  to  set  the  test  item  task  more  clearly. 

•  When  the  stem  is  vmtten  as  an  incomplete  statement,  the  option  statements  should  complete  the 
sentence,  rather  than  beginning  the  item  stem,  or  being  inserted  in  the  middle  of  the  item  stem. 

•  Reduce  the  “reading  load”  as  much  as  possible.  Avoid  repeating  words  in  the  option  statements, 
by  placing  these  words  in  the  stem. 

•  Do  not  provide  verbal  clues  that  point  to  the  correct  option  or  to  elimination  of  incorrect  option(s), 
such  as  disagreement  between  singular  or  plural,  “a”  and  “an”,  etc. 

•  Make  all  option  statements  fit  or  match  the  stem. 

•  Have  one,  and  only  one,  correct  option. 

•  Make  the  options  approximately  equal  in  length.  Avoid  the  tendency  to  make  the  correct  option 
more  detailed. 

•  Make  the  options  logically  parallel,  and  about  equal  in  complexity. 

•  Make  the  options  grammatically  and  syntactically  parallel.  Use  grammatically  and  syntactically 
parallel  words  in  the  stem,  the  distracters,  and  the  correct  option. 

•  Avoid  using  modifiers  such  as  “sometimes”  and  “usually”  in  the  options. 

•  Ensure  that  each  option  has  a  unique  meaning.  Eliminate  distracters  with  the  same  or  similar 
meanings  from  the  test  item. 

•  Make  all  distracters  plausible  to  a  naive  examinee.  Do  not  include  implausible  or  impossible 
options  as  distracters  in  a  test  item. 

•  Arrange  the  options  in  some  appropriate,  logical  order. 

•  Vary  the  position  of  the  correct  option. 

•  Avoid  using  “all  of  the  above”  as  an  option,  and  use  “none  of  the  above”  sparingly. 

•  Avoid  using  negative  words,  including  “except”  in  the  stem  and  in  the  options.  If  it  is  necessary 
to  use  negative  words,  underline,  capitalize,  or  highlight  them  for  emphasis  and  examinee 
visibility. 
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